From wes.turner at gmail.com Mon Jan 2 01:10:45 2017 From: wes.turner at gmail.com (Wes Turner) Date: Mon, 2 Jan 2017 00:10:45 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: On Wednesday, December 28, 2016, Wes Turner wrote: > > > On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha > wrote: > >> Woohoo! We jumped 286 positions with a meager 0.00448 improvement in our >> score! Currently sitting at 798th place. >> > > Nice work! Features of your feature engineering I admire: > > - nominal, ordinal, continuous, discrete > categorical = nominal + discrete > numeric = continuous + discrete > > - outlier removal > - [ ] w/ constant thresholding? (is there a distribution parameter) > > - building datestrings from SaleMonth and YrSold > - SaleMonth / "1" / YrSold > - df..drop(['MoSold','YrSold','SaleMonth']) > - [ ] why drop SaleMonth? > - [ ] pandas.to_datetime[df['SaleMonth']) > > - merging with FHA Home Price Index for the month and region ("West North > Central") > https://www.fhfa.gov/DataTools/Downloads/Documents/ > HPI/HPI_PO_monthly_hist.xls > - [ ] pandas.to_datetime > - this should have every month, but the new merge_asof feature is > worth mentioning > > - manual binarization > - [ ] how did you pick these? correlation after pd.get_dummies? > - [ ] why floats? 1.0 / 1 (does it make a difference?) > > - Ames, IA nbrhood_multiplier > - http://www.cityofames.org/home/showdocument?id=1024 > > - feature merging > - BsmtFinSF = BsmtFinSF1 + BsmtFinSF2 > - TotalBaths = BsmtFullBath + (BsmtHalfBath / 2.0) + FullBath + > (HalfBath / 2.0) > - ( ) IDK how a feature-selection pipeline could do this automatically > > - null value imputation > - .isnull() = 0 > - ( ) datacleaner incorrectly sets these to median or mode > > - log for skewed continuous and SalePrice > - ( ) auto_ml: take_log_of_y does this for SalePrice > > - "Keeping only the columns we want" > - [ ] 'Id' shouldn't be relevant (pd.read_csv(filename, index_col='Id') > > > - Binarization > - pd.get_dummies(dummy_na=False) > - [ ] (a Luke pointed out, concatenation keeps the same columns) > rows = eng_train.shape[0] > eng_merged = pd.concat(eng_train, eng_test) > onehot_merged = pd.get_dummies(eng_merged, columns=nominal, > dummy_na=False) > onehot_train = eng_merged[:rows] > onehot_test = eng_merged[rows:] > > - class RandomSelectionHelper > - [ ] this could be generally helpful in sklean[-pandas] > - https://github.com/paulgb/sklearn-pandas#cross-validation > > - Models to Search > - {Ridge, Lasso, ElasticNet} > > - https://github.com/ClimbsRocks/auto_ml/blob/ > master/auto_ml/predictor.py#L222 > _get_estimator_names ( "regressor" ) > - {XGBRegessor, GradientBoostingRegressor, RANSACRegressor, > RandomForestRegressor, LinearRegression, AdaBoostRegressor, > ExtraTreesRegressor} > > - https://github.com/ClimbsRocks/auto_ml/blob/ > master/auto_ml/predictor.py#L491 > - (w/ ensembling) > - ['RandomForestRegressor', 'LinearRegression', > 'ExtraTreesRegressor', 'Ridge', 'GradientBoostingRegressor', > 'AdaBoostRegressor', 'Lasso', 'ElasticNet', 'LassoLars', > 'OrthogonalMatchingPursuit', 'BayesianRidge', 'SGDRegressor'] + [' > XGBRegressor'] > > - model stacking / ensembling > > - ( ) auto_ml: https://auto-ml.readthedocs.io/en/latest/ensembling.html > - ( ) auto-sklearn: > https://automl.github.io/auto-sklearn/stable/api.html# > autosklearn.regression.AutoSklearnRegressor > ensemble_size=50, ensemble_nbest=50 > https://en.wikipedia.org/wiki/Ensemble_learning http://www.scholarpedia.org/article/Ensemble_learning#Ensemble_combination_rules > > - submission['SalePrice'] = submission.SalePrice.apply(lambda x: > np.exp(x)) > > - [ ] What is this called / how does this work? > - https://docs.scipy.org/doc/numpy/reference/generated/numpy.exp.html > > - df.to_csv(filename, columns=['SalePrice'], index_label='Id') also works > - http://pandas.pydata.org/pandas-docs/stable/generated/ > pandas.DataFrame.to_csv.html > > > >> My notebook is on GitHub for those interested: >> >> https://github.com/jeremy-doyle/home_price_kaggle/tree/master/attempt_4 > > > Thanks! > (Trimmed for 40K limit) From bob.haffner at gmail.com Tue Jan 3 20:51:05 2017 From: bob.haffner at gmail.com (Bob Haffner) Date: Tue, 3 Jan 2017 19:51:05 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: Pretty interesting notebook I put together regarding the kaggle comp https://github.com/bobhaffner/kaggle-houseprices/blob/master/additional_training_data.ipynb On Mon, Jan 2, 2017 at 12:10 AM, Wes Turner via Omaha wrote: > On Wednesday, December 28, 2016, Wes Turner wrote: > > > > > > > On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha < > omaha at python.org > > > wrote: > > > >> Woohoo! We jumped 286 positions with a meager 0.00448 improvement in our > >> score! Currently sitting at 798th place. > >> > > > > Nice work! Features of your feature engineering I admire: > > > > - nominal, ordinal, continuous, discrete > > categorical = nominal + discrete > > numeric = continuous + discrete > > > > - outlier removal > > - [ ] w/ constant thresholding? (is there a distribution parameter) > > > > - building datestrings from SaleMonth and YrSold > > - SaleMonth / "1" / YrSold > > - df..drop(['MoSold','YrSold','SaleMonth']) > > - [ ] why drop SaleMonth? > > - [ ] pandas.to_datetime[df['SaleMonth']) > > > > - merging with FHA Home Price Index for the month and region ("West North > > Central") > > https://www.fhfa.gov/DataTools/Downloads/Documents/ > > HPI/HPI_PO_monthly_hist.xls > > - [ ] pandas.to_datetime > > - this should have every month, but the new merge_asof feature is > > worth mentioning > > > > - manual binarization > > - [ ] how did you pick these? correlation after pd.get_dummies? > > - [ ] why floats? 1.0 / 1 (does it make a difference?) > > > > - Ames, IA nbrhood_multiplier > > - http://www.cityofames.org/home/showdocument?id=1024 > > > > - feature merging > > - BsmtFinSF = BsmtFinSF1 + BsmtFinSF2 > > - TotalBaths = BsmtFullBath + (BsmtHalfBath / 2.0) + FullBath + > > (HalfBath / 2.0) > > - ( ) IDK how a feature-selection pipeline could do this automatically > > > > - null value imputation > > - .isnull() = 0 > > - ( ) datacleaner incorrectly sets these to median or mode > > > > - log for skewed continuous and SalePrice > > - ( ) auto_ml: take_log_of_y does this for SalePrice > > > > - "Keeping only the columns we want" > > - [ ] 'Id' shouldn't be relevant (pd.read_csv(filename, index_col='Id') > > > > > > - Binarization > > - pd.get_dummies(dummy_na=False) > > - [ ] (a Luke pointed out, concatenation keeps the same columns) > > rows = eng_train.shape[0] > > eng_merged = pd.concat(eng_train, eng_test) > > onehot_merged = pd.get_dummies(eng_merged, columns=nominal, > > dummy_na=False) > > onehot_train = eng_merged[:rows] > > onehot_test = eng_merged[rows:] > > > > - class RandomSelectionHelper > > - [ ] this could be generally helpful in sklean[-pandas] > > - https://github.com/paulgb/sklearn-pandas#cross-validation > > > > - Models to Search > > - {Ridge, Lasso, ElasticNet} > > > > - https://github.com/ClimbsRocks/auto_ml/blob/ > > master/auto_ml/predictor.py#L222 > > _get_estimator_names ( "regressor" ) > > - {XGBRegessor, GradientBoostingRegressor, RANSACRegressor, > > RandomForestRegressor, LinearRegression, AdaBoostRegressor, > > ExtraTreesRegressor} > > > > - https://github.com/ClimbsRocks/auto_ml/blob/ > > master/auto_ml/predictor.py#L491 > > - (w/ ensembling) > > - ['RandomForestRegressor', 'LinearRegression', > > 'ExtraTreesRegressor', 'Ridge', 'GradientBoostingRegressor', > > 'AdaBoostRegressor', 'Lasso', 'ElasticNet', 'LassoLars', > > 'OrthogonalMatchingPursuit', 'BayesianRidge', 'SGDRegressor'] + [' > > XGBRegressor'] > > > > - model stacking / ensembling > > > > - ( ) auto_ml: https://auto-ml.readthedocs. > io/en/latest/ensembling.html > > - ( ) auto-sklearn: > > https://automl.github.io/auto-sklearn/stable/api.html# > > autosklearn.regression.AutoSklearnRegressor > > ensemble_size=50, ensemble_nbest=50 > > > > https://en.wikipedia.org/wiki/Ensemble_learning > > http://www.scholarpedia.org/article/Ensemble_learning# > Ensemble_combination_rules > > > > > > - submission['SalePrice'] = submission.SalePrice.apply(lambda x: > > np.exp(x)) > > > > - [ ] What is this called / how does this work? > > - https://docs.scipy.org/doc/numpy/reference/generated/ > numpy.exp.html > > > > - df.to_csv(filename, columns=['SalePrice'], index_label='Id') also works > > - http://pandas.pydata.org/pandas-docs/stable/generated/ > > pandas.DataFrame.to_csv.html > > > > > > > >> My notebook is on GitHub for those interested: > >> > >> https://github.com/jeremy-doyle/home_price_kaggle/tree/master/attempt_4 > > > > > > Thanks! > > > > (Trimmed for 40K limit) > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From uiab1638 at yahoo.com Tue Jan 3 23:41:17 2017 From: uiab1638 at yahoo.com (Jeremy Doyle) Date: Tue, 3 Jan 2017 22:41:17 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> Message-ID: <63E88FA3-5AB5-4F8D-A610-2FE27F2AB772@yahoo.com> Looks like we have our key to a score of 0.0. Lol Seriously though, does anyone wonder if the person sitting at #1 had this full data set as well and trained a model using the entire set? I mean that 0.038 score is so much better than anyone else it seems a little unrealistic...or maybe it's just seems that way because I haven't been able to break through 0.12 : ) Sent from my iPhone > On Jan 3, 2017, at 7:51 PM, Bob Haffner via Omaha wrote: > > Pretty interesting notebook I put together regarding the kaggle comp > https://github.com/bobhaffner/kaggle-houseprices/blob/master/additional_training_data.ipynb > > On Mon, Jan 2, 2017 at 12:10 AM, Wes Turner via Omaha > wrote: > >>> On Wednesday, December 28, 2016, Wes Turner wrote: >>> >>> >>> >>> On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha < >> omaha at python.org >>> > wrote: >>> >>>> Woohoo! We jumped 286 positions with a meager 0.00448 improvement in our >>>> score! Currently sitting at 798th place. >>>> >>> >>> Nice work! Features of your feature engineering I admire: >>> >>> - nominal, ordinal, continuous, discrete >>> categorical = nominal + discrete >>> numeric = continuous + discrete >>> >>> - outlier removal >>> - [ ] w/ constant thresholding? (is there a distribution parameter) >>> >>> - building datestrings from SaleMonth and YrSold >>> - SaleMonth / "1" / YrSold >>> - df..drop(['MoSold','YrSold','SaleMonth']) >>> - [ ] why drop SaleMonth? >>> - [ ] pandas.to_datetime[df['SaleMonth']) >>> >>> - merging with FHA Home Price Index for the month and region ("West North >>> Central") >>> https://www.fhfa.gov/DataTools/Downloads/Documents/ >>> HPI/HPI_PO_monthly_hist.xls >>> - [ ] pandas.to_datetime >>> - this should have every month, but the new merge_asof feature is >>> worth mentioning >>> >>> - manual binarization >>> - [ ] how did you pick these? correlation after pd.get_dummies? >>> - [ ] why floats? 1.0 / 1 (does it make a difference?) >>> >>> - Ames, IA nbrhood_multiplier >>> - http://www.cityofames.org/home/showdocument?id=1024 >>> >>> - feature merging >>> - BsmtFinSF = BsmtFinSF1 + BsmtFinSF2 >>> - TotalBaths = BsmtFullBath + (BsmtHalfBath / 2.0) + FullBath + >>> (HalfBath / 2.0) >>> - ( ) IDK how a feature-selection pipeline could do this automatically >>> >>> - null value imputation >>> - .isnull() = 0 >>> - ( ) datacleaner incorrectly sets these to median or mode >>> >>> - log for skewed continuous and SalePrice >>> - ( ) auto_ml: take_log_of_y does this for SalePrice >>> >>> - "Keeping only the columns we want" >>> - [ ] 'Id' shouldn't be relevant (pd.read_csv(filename, index_col='Id') >>> >>> >>> - Binarization >>> - pd.get_dummies(dummy_na=False) >>> - [ ] (a Luke pointed out, concatenation keeps the same columns) >>> rows = eng_train.shape[0] >>> eng_merged = pd.concat(eng_train, eng_test) >>> onehot_merged = pd.get_dummies(eng_merged, columns=nominal, >>> dummy_na=False) >>> onehot_train = eng_merged[:rows] >>> onehot_test = eng_merged[rows:] >>> >>> - class RandomSelectionHelper >>> - [ ] this could be generally helpful in sklean[-pandas] >>> - https://github.com/paulgb/sklearn-pandas#cross-validation >>> >>> - Models to Search >>> - {Ridge, Lasso, ElasticNet} >>> >>> - https://github.com/ClimbsRocks/auto_ml/blob/ >>> master/auto_ml/predictor.py#L222 >>> _get_estimator_names ( "regressor" ) >>> - {XGBRegessor, GradientBoostingRegressor, RANSACRegressor, >>> RandomForestRegressor, LinearRegression, AdaBoostRegressor, >>> ExtraTreesRegressor} >>> >>> - https://github.com/ClimbsRocks/auto_ml/blob/ >>> master/auto_ml/predictor.py#L491 >>> - (w/ ensembling) >>> - ['RandomForestRegressor', 'LinearRegression', >>> 'ExtraTreesRegressor', 'Ridge', 'GradientBoostingRegressor', >>> 'AdaBoostRegressor', 'Lasso', 'ElasticNet', 'LassoLars', >>> 'OrthogonalMatchingPursuit', 'BayesianRidge', 'SGDRegressor'] + [' >>> XGBRegressor'] >>> >>> - model stacking / ensembling >>> >>> - ( ) auto_ml: https://auto-ml.readthedocs. >> io/en/latest/ensembling.html >>> - ( ) auto-sklearn: >>> https://automl.github.io/auto-sklearn/stable/api.html# >>> autosklearn.regression.AutoSklearnRegressor >>> ensemble_size=50, ensemble_nbest=50 >>> >> >> https://en.wikipedia.org/wiki/Ensemble_learning >> >> http://www.scholarpedia.org/article/Ensemble_learning# >> Ensemble_combination_rules >> >> >>> >>> - submission['SalePrice'] = submission.SalePrice.apply(lambda x: >>> np.exp(x)) >>> >>> - [ ] What is this called / how does this work? >>> - https://docs.scipy.org/doc/numpy/reference/generated/ >> numpy.exp.html >>> >>> - df.to_csv(filename, columns=['SalePrice'], index_label='Id') also works >>> - http://pandas.pydata.org/pandas-docs/stable/generated/ >>> pandas.DataFrame.to_csv.html >>> >>> >>> >>>> My notebook is on GitHub for those interested: >>>> >>>> https://github.com/jeremy-doyle/home_price_kaggle/tree/master/attempt_4 >>> >>> >>> Thanks! >>> >> >> (Trimmed for 40K limit) >> _______________________________________________ >> Omaha Python Users Group mailing list >> Omaha at python.org >> https://mail.python.org/mailman/listinfo/omaha >> http://www.OmahaPython.org >> > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org From wes.turner at gmail.com Wed Jan 4 00:49:34 2017 From: wes.turner at gmail.com (Wes Turner) Date: Tue, 3 Jan 2017 23:49:34 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: <63E88FA3-5AB5-4F8D-A610-2FE27F2AB772@yahoo.com> References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> <63E88FA3-5AB5-4F8D-A610-2FE27F2AB772@yahoo.com> Message-ID: https://docs.scipy.org/doc/numpy/reference/routines.random.html#distributions ( I haven't looked. ) On Tue, Jan 3, 2017 at 10:41 PM, Jeremy Doyle via Omaha wrote: > Looks like we have our key to a score of 0.0. Lol > > Seriously though, does anyone wonder if the person sitting at #1 had this > full data set as well and trained a model using the entire set? I mean that > 0.038 score is so much better than anyone else it seems a little > unrealistic...or maybe it's just seems that way because I haven't been able > to break through 0.12 : ) > > > > > > Sent from my iPhone > > On Jan 3, 2017, at 7:51 PM, Bob Haffner via Omaha > wrote: > > > > Pretty interesting notebook I put together regarding the kaggle comp > > https://github.com/bobhaffner/kaggle-houseprices/blob/ > master/additional_training_data.ipynb > > > > On Mon, Jan 2, 2017 at 12:10 AM, Wes Turner via Omaha > > wrote: > > > >>> On Wednesday, December 28, 2016, Wes Turner > wrote: > >>> > >>> > >>> > >>> On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha < > >> omaha at python.org > >>> > wrote: > >>> > >>>> Woohoo! We jumped 286 positions with a meager 0.00448 improvement in > our > >>>> score! Currently sitting at 798th place. > >>>> > >>> > >>> Nice work! Features of your feature engineering I admire: > >>> > >>> - nominal, ordinal, continuous, discrete > >>> categorical = nominal + discrete > >>> numeric = continuous + discrete > >>> > >>> - outlier removal > >>> - [ ] w/ constant thresholding? (is there a distribution parameter) > >>> > >>> - building datestrings from SaleMonth and YrSold > >>> - SaleMonth / "1" / YrSold > >>> - df..drop(['MoSold','YrSold','SaleMonth']) > >>> - [ ] why drop SaleMonth? > >>> - [ ] pandas.to_datetime[df['SaleMonth']) > >>> > >>> - merging with FHA Home Price Index for the month and region ("West > North > >>> Central") > >>> https://www.fhfa.gov/DataTools/Downloads/Documents/ > >>> HPI/HPI_PO_monthly_hist.xls > >>> - [ ] pandas.to_datetime > >>> - this should have every month, but the new merge_asof feature is > >>> worth mentioning > >>> > >>> - manual binarization > >>> - [ ] how did you pick these? correlation after pd.get_dummies? > >>> - [ ] why floats? 1.0 / 1 (does it make a difference?) > >>> > >>> - Ames, IA nbrhood_multiplier > >>> - http://www.cityofames.org/home/showdocument?id=1024 > >>> > >>> - feature merging > >>> - BsmtFinSF = BsmtFinSF1 + BsmtFinSF2 > >>> - TotalBaths = BsmtFullBath + (BsmtHalfBath / 2.0) + FullBath + > >>> (HalfBath / 2.0) > >>> - ( ) IDK how a feature-selection pipeline could do this automatically > >>> > >>> - null value imputation > >>> - .isnull() = 0 > >>> - ( ) datacleaner incorrectly sets these to median or mode > >>> > >>> - log for skewed continuous and SalePrice > >>> - ( ) auto_ml: take_log_of_y does this for SalePrice > >>> > >>> - "Keeping only the columns we want" > >>> - [ ] 'Id' shouldn't be relevant (pd.read_csv(filename, > index_col='Id') > >>> > >>> > >>> - Binarization > >>> - pd.get_dummies(dummy_na=False) > >>> - [ ] (a Luke pointed out, concatenation keeps the same columns) > >>> rows = eng_train.shape[0] > >>> eng_merged = pd.concat(eng_train, eng_test) > >>> onehot_merged = pd.get_dummies(eng_merged, columns=nominal, > >>> dummy_na=False) > >>> onehot_train = eng_merged[:rows] > >>> onehot_test = eng_merged[rows:] > >>> > >>> - class RandomSelectionHelper > >>> - [ ] this could be generally helpful in sklean[-pandas] > >>> - https://github.com/paulgb/sklearn-pandas#cross-validation > >>> > >>> - Models to Search > >>> - {Ridge, Lasso, ElasticNet} > >>> > >>> - https://github.com/ClimbsRocks/auto_ml/blob/ > >>> master/auto_ml/predictor.py#L222 > >>> _get_estimator_names ( "regressor" ) > >>> - {XGBRegessor, GradientBoostingRegressor, RANSACRegressor, > >>> RandomForestRegressor, LinearRegression, AdaBoostRegressor, > >>> ExtraTreesRegressor} > >>> > >>> - https://github.com/ClimbsRocks/auto_ml/blob/ > >>> master/auto_ml/predictor.py#L491 > >>> - (w/ ensembling) > >>> - ['RandomForestRegressor', 'LinearRegression', > >>> 'ExtraTreesRegressor', 'Ridge', 'GradientBoostingRegressor', > >>> 'AdaBoostRegressor', 'Lasso', 'ElasticNet', 'LassoLars', > >>> 'OrthogonalMatchingPursuit', 'BayesianRidge', 'SGDRegressor'] + [' > >>> XGBRegressor'] > >>> > >>> - model stacking / ensembling > >>> > >>> - ( ) auto_ml: https://auto-ml.readthedocs. > >> io/en/latest/ensembling.html > >>> - ( ) auto-sklearn: > >>> https://automl.github.io/auto-sklearn/stable/api.html# > >>> autosklearn.regression.AutoSklearnRegressor > >>> ensemble_size=50, ensemble_nbest=50 > >>> > >> > >> https://en.wikipedia.org/wiki/Ensemble_learning > >> > >> http://www.scholarpedia.org/article/Ensemble_learning# > >> Ensemble_combination_rules > >> > >> > >>> > >>> - submission['SalePrice'] = submission.SalePrice.apply(lambda x: > >>> np.exp(x)) > >>> > >>> - [ ] What is this called / how does this work? > >>> - https://docs.scipy.org/doc/numpy/reference/generated/ > >> numpy.exp.html > >>> > >>> - df.to_csv(filename, columns=['SalePrice'], index_label='Id') also > works > >>> - http://pandas.pydata.org/pandas-docs/stable/generated/ > >>> pandas.DataFrame.to_csv.html > >>> > >>> > >>> > >>>> My notebook is on GitHub for those interested: > >>>> > >>>> https://github.com/jeremy-doyle/home_price_kaggle/tree/ > master/attempt_4 > >>> > >>> > >>> Thanks! > >>> > >> > >> (Trimmed for 40K limit) > >> _______________________________________________ > >> Omaha Python Users Group mailing list > >> Omaha at python.org > >> https://mail.python.org/mailman/listinfo/omaha > >> http://www.OmahaPython.org > >> > > _______________________________________________ > > Omaha Python Users Group mailing list > > Omaha at python.org > > https://mail.python.org/mailman/listinfo/omaha > > http://www.OmahaPython.org > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From wes.turner at gmail.com Wed Jan 4 00:50:09 2017 From: wes.turner at gmail.com (Wes Turner) Date: Tue, 3 Jan 2017 23:50:09 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> <63E88FA3-5AB5-4F8D-A610-2FE27F2AB772@yahoo.com> Message-ID: ... https://en.wikipedia.org/wiki/Regression_(psychology) On Tue, Jan 3, 2017 at 11:49 PM, Wes Turner wrote: > https://docs.scipy.org/doc/numpy/reference/routines. > random.html#distributions > > ( I haven't looked. ) > > On Tue, Jan 3, 2017 at 10:41 PM, Jeremy Doyle via Omaha > wrote: > >> Looks like we have our key to a score of 0.0. Lol >> >> Seriously though, does anyone wonder if the person sitting at #1 had this >> full data set as well and trained a model using the entire set? I mean that >> 0.038 score is so much better than anyone else it seems a little >> unrealistic...or maybe it's just seems that way because I haven't been able >> to break through 0.12 : ) >> >> >> >> >> >> Sent from my iPhone >> > On Jan 3, 2017, at 7:51 PM, Bob Haffner via Omaha >> wrote: >> > >> > Pretty interesting notebook I put together regarding the kaggle comp >> > https://github.com/bobhaffner/kaggle-houseprices/blob/master >> /additional_training_data.ipynb >> > >> > On Mon, Jan 2, 2017 at 12:10 AM, Wes Turner via Omaha > > >> > wrote: >> > >> >>> On Wednesday, December 28, 2016, Wes Turner >> wrote: >> >>> >> >>> >> >>> >> >>> On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha < >> >> omaha at python.org >> >>> > wrote: >> >>> >> >>>> Woohoo! We jumped 286 positions with a meager 0.00448 improvement in >> our >> >>>> score! Currently sitting at 798th place. >> >>>> >> >>> >> >>> Nice work! Features of your feature engineering I admire: >> >>> >> >>> - nominal, ordinal, continuous, discrete >> >>> categorical = nominal + discrete >> >>> numeric = continuous + discrete >> >>> >> >>> - outlier removal >> >>> - [ ] w/ constant thresholding? (is there a distribution parameter) >> >>> >> >>> - building datestrings from SaleMonth and YrSold >> >>> - SaleMonth / "1" / YrSold >> >>> - df..drop(['MoSold','YrSold','SaleMonth']) >> >>> - [ ] why drop SaleMonth? >> >>> - [ ] pandas.to_datetime[df['SaleMonth']) >> >>> >> >>> - merging with FHA Home Price Index for the month and region ("West >> North >> >>> Central") >> >>> https://www.fhfa.gov/DataTools/Downloads/Documents/ >> >>> HPI/HPI_PO_monthly_hist.xls >> >>> - [ ] pandas.to_datetime >> >>> - this should have every month, but the new merge_asof feature is >> >>> worth mentioning >> >>> >> >>> - manual binarization >> >>> - [ ] how did you pick these? correlation after pd.get_dummies? >> >>> - [ ] why floats? 1.0 / 1 (does it make a difference?) >> >>> >> >>> - Ames, IA nbrhood_multiplier >> >>> - http://www.cityofames.org/home/showdocument?id=1024 >> >>> >> >>> - feature merging >> >>> - BsmtFinSF = BsmtFinSF1 + BsmtFinSF2 >> >>> - TotalBaths = BsmtFullBath + (BsmtHalfBath / 2.0) + FullBath + >> >>> (HalfBath / 2.0) >> >>> - ( ) IDK how a feature-selection pipeline could do this >> automatically >> >>> >> >>> - null value imputation >> >>> - .isnull() = 0 >> >>> - ( ) datacleaner incorrectly sets these to median or mode >> >>> >> >>> - log for skewed continuous and SalePrice >> >>> - ( ) auto_ml: take_log_of_y does this for SalePrice >> >>> >> >>> - "Keeping only the columns we want" >> >>> - [ ] 'Id' shouldn't be relevant (pd.read_csv(filename, >> index_col='Id') >> >>> >> >>> >> >>> - Binarization >> >>> - pd.get_dummies(dummy_na=False) >> >>> - [ ] (a Luke pointed out, concatenation keeps the same columns) >> >>> rows = eng_train.shape[0] >> >>> eng_merged = pd.concat(eng_train, eng_test) >> >>> onehot_merged = pd.get_dummies(eng_merged, columns=nominal, >> >>> dummy_na=False) >> >>> onehot_train = eng_merged[:rows] >> >>> onehot_test = eng_merged[rows:] >> >>> >> >>> - class RandomSelectionHelper >> >>> - [ ] this could be generally helpful in sklean[-pandas] >> >>> - https://github.com/paulgb/sklearn-pandas#cross-validation >> >>> >> >>> - Models to Search >> >>> - {Ridge, Lasso, ElasticNet} >> >>> >> >>> - https://github.com/ClimbsRocks/auto_ml/blob/ >> >>> master/auto_ml/predictor.py#L222 >> >>> _get_estimator_names ( "regressor" ) >> >>> - {XGBRegessor, GradientBoostingRegressor, RANSACRegressor, >> >>> RandomForestRegressor, LinearRegression, AdaBoostRegressor, >> >>> ExtraTreesRegressor} >> >>> >> >>> - https://github.com/ClimbsRocks/auto_ml/blob/ >> >>> master/auto_ml/predictor.py#L491 >> >>> - (w/ ensembling) >> >>> - ['RandomForestRegressor', 'LinearRegression', >> >>> 'ExtraTreesRegressor', 'Ridge', 'GradientBoostingRegressor', >> >>> 'AdaBoostRegressor', 'Lasso', 'ElasticNet', 'LassoLars', >> >>> 'OrthogonalMatchingPursuit', 'BayesianRidge', 'SGDRegressor'] + [' >> >>> XGBRegressor'] >> >>> >> >>> - model stacking / ensembling >> >>> >> >>> - ( ) auto_ml: https://auto-ml.readthedocs. >> >> io/en/latest/ensembling.html >> >>> - ( ) auto-sklearn: >> >>> https://automl.github.io/auto-sklearn/stable/api.html# >> >>> autosklearn.regression.AutoSklearnRegressor >> >>> ensemble_size=50, ensemble_nbest=50 >> >>> >> >> >> >> https://en.wikipedia.org/wiki/Ensemble_learning >> >> >> >> http://www.scholarpedia.org/article/Ensemble_learning# >> >> Ensemble_combination_rules >> >> >> >> >> >>> >> >>> - submission['SalePrice'] = submission.SalePrice.apply(lambda x: >> >>> np.exp(x)) >> >>> >> >>> - [ ] What is this called / how does this work? >> >>> - https://docs.scipy.org/doc/numpy/reference/generated/ >> >> numpy.exp.html >> >>> >> >>> - df.to_csv(filename, columns=['SalePrice'], index_label='Id') also >> works >> >>> - http://pandas.pydata.org/pandas-docs/stable/generated/ >> >>> pandas.DataFrame.to_csv.html >> >>> >> >>> >> >>> >> >>>> My notebook is on GitHub for those interested: >> >>>> >> >>>> https://github.com/jeremy-doyle/home_price_kaggle/tree/maste >> r/attempt_4 >> >>> >> >>> >> >>> Thanks! >> >>> >> >> >> >> (Trimmed for 40K limit) >> >> _______________________________________________ >> >> Omaha Python Users Group mailing list >> >> Omaha at python.org >> >> https://mail.python.org/mailman/listinfo/omaha >> >> http://www.OmahaPython.org >> >> >> > _______________________________________________ >> > Omaha Python Users Group mailing list >> > Omaha at python.org >> > https://mail.python.org/mailman/listinfo/omaha >> > http://www.OmahaPython.org >> >> _______________________________________________ >> Omaha Python Users Group mailing list >> Omaha at python.org >> https://mail.python.org/mailman/listinfo/omaha >> http://www.OmahaPython.org >> > > From wes.turner at gmail.com Wed Jan 4 00:54:27 2017 From: wes.turner at gmail.com (Wes Turner) Date: Tue, 3 Jan 2017 23:54:27 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> <63E88FA3-5AB5-4F8D-A610-2FE27F2AB772@yahoo.com> Message-ID: SciPy also has tools for statistical distributions: https://docs.scipy.org/doc/scipy/reference/stats.html#multivariate-distributions - https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rayleigh.html#scipy.stats.rayleigh - https://en.wikipedia.org/wiki/Rayleigh_scattering On Tue, Jan 3, 2017 at 11:49 PM, Wes Turner wrote: > https://docs.scipy.org/doc/numpy/reference/routines. > random.html#distributions > > ( I haven't looked. ) > > On Tue, Jan 3, 2017 at 10:41 PM, Jeremy Doyle via Omaha > wrote: > >> Looks like we have our key to a score of 0.0. Lol >> >> Seriously though, does anyone wonder if the person sitting at #1 had this >> full data set as well and trained a model using the entire set? I mean that >> 0.038 score is so much better than anyone else it seems a little >> unrealistic...or maybe it's just seems that way because I haven't been able >> to break through 0.12 : ) >> >> >> >> >> >> Sent from my iPhone >> > On Jan 3, 2017, at 7:51 PM, Bob Haffner via Omaha >> wrote: >> > >> > Pretty interesting notebook I put together regarding the kaggle comp >> > https://github.com/bobhaffner/kaggle-houseprices/blob/master >> /additional_training_data.ipynb >> > >> > On Mon, Jan 2, 2017 at 12:10 AM, Wes Turner via Omaha > > >> > wrote: >> > >> >>> On Wednesday, December 28, 2016, Wes Turner >> wrote: >> >>> >> >>> >> >>> >> >>> On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha < >> >> omaha at python.org >> >>> > wrote: >> >>> >> >>>> Woohoo! We jumped 286 positions with a meager 0.00448 improvement in >> our >> >>>> score! Currently sitting at 798th place. >> >>>> >> >>> >> >>> Nice work! Features of your feature engineering I admire: >> >>> >> >>> - nominal, ordinal, continuous, discrete >> >>> categorical = nominal + discrete >> >>> numeric = continuous + discrete >> >>> >> >>> - outlier removal >> >>> - [ ] w/ constant thresholding? (is there a distribution parameter) >> >>> >> >>> - building datestrings from SaleMonth and YrSold >> >>> - SaleMonth / "1" / YrSold >> >>> - df..drop(['MoSold','YrSold','SaleMonth']) >> >>> - [ ] why drop SaleMonth? >> >>> - [ ] pandas.to_datetime[df['SaleMonth']) >> >>> >> >>> - merging with FHA Home Price Index for the month and region ("West >> North >> >>> Central") >> >>> https://www.fhfa.gov/DataTools/Downloads/Documents/ >> >>> HPI/HPI_PO_monthly_hist.xls >> >>> - [ ] pandas.to_datetime >> >>> - this should have every month, but the new merge_asof feature is >> >>> worth mentioning >> >>> >> >>> - manual binarization >> >>> - [ ] how did you pick these? correlation after pd.get_dummies? >> >>> - [ ] why floats? 1.0 / 1 (does it make a difference?) >> >>> >> >>> - Ames, IA nbrhood_multiplier >> >>> - http://www.cityofames.org/home/showdocument?id=1024 >> >>> >> >>> - feature merging >> >>> - BsmtFinSF = BsmtFinSF1 + BsmtFinSF2 >> >>> - TotalBaths = BsmtFullBath + (BsmtHalfBath / 2.0) + FullBath + >> >>> (HalfBath / 2.0) >> >>> - ( ) IDK how a feature-selection pipeline could do this >> automatically >> >>> >> >>> - null value imputation >> >>> - .isnull() = 0 >> >>> - ( ) datacleaner incorrectly sets these to median or mode >> >>> >> >>> - log for skewed continuous and SalePrice >> >>> - ( ) auto_ml: take_log_of_y does this for SalePrice >> >>> >> >>> - "Keeping only the columns we want" >> >>> - [ ] 'Id' shouldn't be relevant (pd.read_csv(filename, >> index_col='Id') >> >>> >> >>> >> >>> - Binarization >> >>> - pd.get_dummies(dummy_na=False) >> >>> - [ ] (a Luke pointed out, concatenation keeps the same columns) >> >>> rows = eng_train.shape[0] >> >>> eng_merged = pd.concat(eng_train, eng_test) >> >>> onehot_merged = pd.get_dummies(eng_merged, columns=nominal, >> >>> dummy_na=False) >> >>> onehot_train = eng_merged[:rows] >> >>> onehot_test = eng_merged[rows:] >> >>> >> >>> - class RandomSelectionHelper >> >>> - [ ] this could be generally helpful in sklean[-pandas] >> >>> - https://github.com/paulgb/sklearn-pandas#cross-validation >> >>> >> >>> - Models to Search >> >>> - {Ridge, Lasso, ElasticNet} >> >>> >> >>> - https://github.com/ClimbsRocks/auto_ml/blob/ >> >>> master/auto_ml/predictor.py#L222 >> >>> _get_estimator_names ( "regressor" ) >> >>> - {XGBRegessor, GradientBoostingRegressor, RANSACRegressor, >> >>> RandomForestRegressor, LinearRegression, AdaBoostRegressor, >> >>> ExtraTreesRegressor} >> >>> >> >>> - https://github.com/ClimbsRocks/auto_ml/blob/ >> >>> master/auto_ml/predictor.py#L491 >> >>> - (w/ ensembling) >> >>> - ['RandomForestRegressor', 'LinearRegression', >> >>> 'ExtraTreesRegressor', 'Ridge', 'GradientBoostingRegressor', >> >>> 'AdaBoostRegressor', 'Lasso', 'ElasticNet', 'LassoLars', >> >>> 'OrthogonalMatchingPursuit', 'BayesianRidge', 'SGDRegressor'] + [' >> >>> XGBRegressor'] >> >>> >> >>> - model stacking / ensembling >> >>> >> >>> - ( ) auto_ml: https://auto-ml.readthedocs. >> >> io/en/latest/ensembling.html >> >>> - ( ) auto-sklearn: >> >>> https://automl.github.io/auto-sklearn/stable/api.html# >> >>> autosklearn.regression.AutoSklearnRegressor >> >>> ensemble_size=50, ensemble_nbest=50 >> >>> >> >> >> >> https://en.wikipedia.org/wiki/Ensemble_learning >> >> >> >> http://www.scholarpedia.org/article/Ensemble_learning# >> >> Ensemble_combination_rules >> >> >> >> >> >>> >> >>> - submission['SalePrice'] = submission.SalePrice.apply(lambda x: >> >>> np.exp(x)) >> >>> >> >>> - [ ] What is this called / how does this work? >> >>> - https://docs.scipy.org/doc/numpy/reference/generated/ >> >> numpy.exp.html >> >>> >> >>> - df.to_csv(filename, columns=['SalePrice'], index_label='Id') also >> works >> >>> - http://pandas.pydata.org/pandas-docs/stable/generated/ >> >>> pandas.DataFrame.to_csv.html >> >>> >> >>> >> >>> >> >>>> My notebook is on GitHub for those interested: >> >>>> >> >>>> https://github.com/jeremy-doyle/home_price_kaggle/tree/maste >> r/attempt_4 >> >>> >> >>> >> >>> Thanks! >> >>> >> >> >> >> (Trimmed for 40K limit) >> >> _______________________________________________ >> >> Omaha Python Users Group mailing list >> >> Omaha at python.org >> >> https://mail.python.org/mailman/listinfo/omaha >> >> http://www.OmahaPython.org >> >> >> > _______________________________________________ >> > Omaha Python Users Group mailing list >> > Omaha at python.org >> > https://mail.python.org/mailman/listinfo/omaha >> > http://www.OmahaPython.org >> >> _______________________________________________ >> Omaha Python Users Group mailing list >> Omaha at python.org >> https://mail.python.org/mailman/listinfo/omaha >> http://www.OmahaPython.org >> > > From bob.haffner at gmail.com Wed Jan 4 08:43:43 2017 From: bob.haffner at gmail.com (Bob Haffner) Date: Wed, 4 Jan 2017 07:43:43 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: <63E88FA3-5AB5-4F8D-A610-2FE27F2AB772@yahoo.com> References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> <63E88FA3-5AB5-4F8D-A610-2FE27F2AB772@yahoo.com> Message-ID: Yeah, no kidding. That pdf wasn't hard to find and that #1 score is pretty damn good On Tue, Jan 3, 2017 at 10:41 PM, Jeremy Doyle via Omaha wrote: > Looks like we have our key to a score of 0.0. Lol > > Seriously though, does anyone wonder if the person sitting at #1 had this > full data set as well and trained a model using the entire set? I mean that > 0.038 score is so much better than anyone else it seems a little > unrealistic...or maybe it's just seems that way because I haven't been able > to break through 0.12 : ) > > > > > > Sent from my iPhone > > On Jan 3, 2017, at 7:51 PM, Bob Haffner via Omaha > wrote: > > > > Pretty interesting notebook I put together regarding the kaggle comp > > https://github.com/bobhaffner/kaggle-houseprices/blob/ > master/additional_training_data.ipynb > > > > On Mon, Jan 2, 2017 at 12:10 AM, Wes Turner via Omaha > > wrote: > > > >>> On Wednesday, December 28, 2016, Wes Turner > wrote: > >>> > >>> > >>> > >>> On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha < > >> omaha at python.org > >>> > wrote: > >>> > >>>> Woohoo! We jumped 286 positions with a meager 0.00448 improvement in > our > >>>> score! Currently sitting at 798th place. > >>>> > >>> > >>> Nice work! Features of your feature engineering I admire: > >>> > >>> - nominal, ordinal, continuous, discrete > >>> categorical = nominal + discrete > >>> numeric = continuous + discrete > >>> > >>> - outlier removal > >>> - [ ] w/ constant thresholding? (is there a distribution parameter) > >>> > >>> - building datestrings from SaleMonth and YrSold > >>> - SaleMonth / "1" / YrSold > >>> - df..drop(['MoSold','YrSold','SaleMonth']) > >>> - [ ] why drop SaleMonth? > >>> - [ ] pandas.to_datetime[df['SaleMonth']) > >>> > >>> - merging with FHA Home Price Index for the month and region ("West > North > >>> Central") > >>> https://www.fhfa.gov/DataTools/Downloads/Documents/ > >>> HPI/HPI_PO_monthly_hist.xls > >>> - [ ] pandas.to_datetime > >>> - this should have every month, but the new merge_asof feature is > >>> worth mentioning > >>> > >>> - manual binarization > >>> - [ ] how did you pick these? correlation after pd.get_dummies? > >>> - [ ] why floats? 1.0 / 1 (does it make a difference?) > >>> > >>> - Ames, IA nbrhood_multiplier > >>> - http://www.cityofames.org/home/showdocument?id=1024 > >>> > >>> - feature merging > >>> - BsmtFinSF = BsmtFinSF1 + BsmtFinSF2 > >>> - TotalBaths = BsmtFullBath + (BsmtHalfBath / 2.0) + FullBath + > >>> (HalfBath / 2.0) > >>> - ( ) IDK how a feature-selection pipeline could do this automatically > >>> > >>> - null value imputation > >>> - .isnull() = 0 > >>> - ( ) datacleaner incorrectly sets these to median or mode > >>> > >>> - log for skewed continuous and SalePrice > >>> - ( ) auto_ml: take_log_of_y does this for SalePrice > >>> > >>> - "Keeping only the columns we want" > >>> - [ ] 'Id' shouldn't be relevant (pd.read_csv(filename, > index_col='Id') > >>> > >>> > >>> - Binarization > >>> - pd.get_dummies(dummy_na=False) > >>> - [ ] (a Luke pointed out, concatenation keeps the same columns) > >>> rows = eng_train.shape[0] > >>> eng_merged = pd.concat(eng_train, eng_test) > >>> onehot_merged = pd.get_dummies(eng_merged, columns=nominal, > >>> dummy_na=False) > >>> onehot_train = eng_merged[:rows] > >>> onehot_test = eng_merged[rows:] > >>> > >>> - class RandomSelectionHelper > >>> - [ ] this could be generally helpful in sklean[-pandas] > >>> - https://github.com/paulgb/sklearn-pandas#cross-validation > >>> > >>> - Models to Search > >>> - {Ridge, Lasso, ElasticNet} > >>> > >>> - https://github.com/ClimbsRocks/auto_ml/blob/ > >>> master/auto_ml/predictor.py#L222 > >>> _get_estimator_names ( "regressor" ) > >>> - {XGBRegessor, GradientBoostingRegressor, RANSACRegressor, > >>> RandomForestRegressor, LinearRegression, AdaBoostRegressor, > >>> ExtraTreesRegressor} > >>> > >>> - https://github.com/ClimbsRocks/auto_ml/blob/ > >>> master/auto_ml/predictor.py#L491 > >>> - (w/ ensembling) > >>> - ['RandomForestRegressor', 'LinearRegression', > >>> 'ExtraTreesRegressor', 'Ridge', 'GradientBoostingRegressor', > >>> 'AdaBoostRegressor', 'Lasso', 'ElasticNet', 'LassoLars', > >>> 'OrthogonalMatchingPursuit', 'BayesianRidge', 'SGDRegressor'] + [' > >>> XGBRegressor'] > >>> > >>> - model stacking / ensembling > >>> > >>> - ( ) auto_ml: https://auto-ml.readthedocs. > >> io/en/latest/ensembling.html > >>> - ( ) auto-sklearn: > >>> https://automl.github.io/auto-sklearn/stable/api.html# > >>> autosklearn.regression.AutoSklearnRegressor > >>> ensemble_size=50, ensemble_nbest=50 > >>> > >> > >> https://en.wikipedia.org/wiki/Ensemble_learning > >> > >> http://www.scholarpedia.org/article/Ensemble_learning# > >> Ensemble_combination_rules > >> > >> > >>> > >>> - submission['SalePrice'] = submission.SalePrice.apply(lambda x: > >>> np.exp(x)) > >>> > >>> - [ ] What is this called / how does this work? > >>> - https://docs.scipy.org/doc/numpy/reference/generated/ > >> numpy.exp.html > >>> > >>> - df.to_csv(filename, columns=['SalePrice'], index_label='Id') also > works > >>> - http://pandas.pydata.org/pandas-docs/stable/generated/ > >>> pandas.DataFrame.to_csv.html > >>> > >>> > >>> > >>>> My notebook is on GitHub for those interested: > >>>> > >>>> https://github.com/jeremy-doyle/home_price_kaggle/tree/ > master/attempt_4 > >>> > >>> > >>> Thanks! > >>> > >> > >> (Trimmed for 40K limit) > >> _______________________________________________ > >> Omaha Python Users Group mailing list > >> Omaha at python.org > >> https://mail.python.org/mailman/listinfo/omaha > >> http://www.OmahaPython.org > >> > > _______________________________________________ > > Omaha Python Users Group mailing list > > Omaha at python.org > > https://mail.python.org/mailman/listinfo/omaha > > http://www.OmahaPython.org > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From luke.schollmeyer at gmail.com Wed Jan 4 08:57:27 2017 From: luke.schollmeyer at gmail.com (Luke Schollmeyer) Date: Wed, 4 Jan 2017 07:57:27 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> <63E88FA3-5AB5-4F8D-A610-2FE27F2AB772@yahoo.com> Message-ID: I think there's two probable things: 1. We're likely using some under-powered ML methods. Most of the Kaggle interviews of the top guys/teams I read are using some much more advanced methods to get their solutions into the top spots. I think what we're doing is fine for what we want to accomplish. 2. Feature engineering. Again, many of the interviews show that a ton of work goes in to cleaning and conforming the data. I haven't back tracked any of the interviews to their submissions, so I don't know how often they tend to submit, like tweak a small aspect and keep honing that until it pays off. On Wed, Jan 4, 2017 at 7:43 AM, Bob Haffner via Omaha wrote: > Yeah, no kidding. That pdf wasn't hard to find and that #1 score is pretty > damn good > > On Tue, Jan 3, 2017 at 10:41 PM, Jeremy Doyle via Omaha > wrote: > > > Looks like we have our key to a score of 0.0. Lol > > > > Seriously though, does anyone wonder if the person sitting at #1 had this > > full data set as well and trained a model using the entire set? I mean > that > > 0.038 score is so much better than anyone else it seems a little > > unrealistic...or maybe it's just seems that way because I haven't been > able > > to break through 0.12 : ) > > > > > > > > > > > > Sent from my iPhone > > > On Jan 3, 2017, at 7:51 PM, Bob Haffner via Omaha > > wrote: > > > > > > Pretty interesting notebook I put together regarding the kaggle comp > > > https://github.com/bobhaffner/kaggle-houseprices/blob/ > > master/additional_training_data.ipynb > > > > > > On Mon, Jan 2, 2017 at 12:10 AM, Wes Turner via Omaha < > omaha at python.org> > > > wrote: > > > > > >>> On Wednesday, December 28, 2016, Wes Turner > > wrote: > > >>> > > >>> > > >>> > > >>> On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha < > > >> omaha at python.org > > >>> > wrote: > > >>> > > >>>> Woohoo! We jumped 286 positions with a meager 0.00448 improvement in > > our > > >>>> score! Currently sitting at 798th place. > > >>>> > > >>> > > >>> Nice work! Features of your feature engineering I admire: > > >>> > > >>> - nominal, ordinal, continuous, discrete > > >>> categorical = nominal + discrete > > >>> numeric = continuous + discrete > > >>> > > >>> - outlier removal > > >>> - [ ] w/ constant thresholding? (is there a distribution parameter) > > >>> > > >>> - building datestrings from SaleMonth and YrSold > > >>> - SaleMonth / "1" / YrSold > > >>> - df..drop(['MoSold','YrSold','SaleMonth']) > > >>> - [ ] why drop SaleMonth? > > >>> - [ ] pandas.to_datetime[df['SaleMonth']) > > >>> > > >>> - merging with FHA Home Price Index for the month and region ("West > > North > > >>> Central") > > >>> https://www.fhfa.gov/DataTools/Downloads/Documents/ > > >>> HPI/HPI_PO_monthly_hist.xls > > >>> - [ ] pandas.to_datetime > > >>> - this should have every month, but the new merge_asof feature is > > >>> worth mentioning > > >>> > > >>> - manual binarization > > >>> - [ ] how did you pick these? correlation after pd.get_dummies? > > >>> - [ ] why floats? 1.0 / 1 (does it make a difference?) > > >>> > > >>> - Ames, IA nbrhood_multiplier > > >>> - http://www.cityofames.org/home/showdocument?id=1024 > > >>> > > >>> - feature merging > > >>> - BsmtFinSF = BsmtFinSF1 + BsmtFinSF2 > > >>> - TotalBaths = BsmtFullBath + (BsmtHalfBath / 2.0) + FullBath + > > >>> (HalfBath / 2.0) > > >>> - ( ) IDK how a feature-selection pipeline could do this > automatically > > >>> > > >>> - null value imputation > > >>> - .isnull() = 0 > > >>> - ( ) datacleaner incorrectly sets these to median or mode > > >>> > > >>> - log for skewed continuous and SalePrice > > >>> - ( ) auto_ml: take_log_of_y does this for SalePrice > > >>> > > >>> - "Keeping only the columns we want" > > >>> - [ ] 'Id' shouldn't be relevant (pd.read_csv(filename, > > index_col='Id') > > >>> > > >>> > > >>> - Binarization > > >>> - pd.get_dummies(dummy_na=False) > > >>> - [ ] (a Luke pointed out, concatenation keeps the same columns) > > >>> rows = eng_train.shape[0] > > >>> eng_merged = pd.concat(eng_train, eng_test) > > >>> onehot_merged = pd.get_dummies(eng_merged, columns=nominal, > > >>> dummy_na=False) > > >>> onehot_train = eng_merged[:rows] > > >>> onehot_test = eng_merged[rows:] > > >>> > > >>> - class RandomSelectionHelper > > >>> - [ ] this could be generally helpful in sklean[-pandas] > > >>> - https://github.com/paulgb/sklearn-pandas#cross-validation > > >>> > > >>> - Models to Search > > >>> - {Ridge, Lasso, ElasticNet} > > >>> > > >>> - https://github.com/ClimbsRocks/auto_ml/blob/ > > >>> master/auto_ml/predictor.py#L222 > > >>> _get_estimator_names ( "regressor" ) > > >>> - {XGBRegessor, GradientBoostingRegressor, RANSACRegressor, > > >>> RandomForestRegressor, LinearRegression, AdaBoostRegressor, > > >>> ExtraTreesRegressor} > > >>> > > >>> - https://github.com/ClimbsRocks/auto_ml/blob/ > > >>> master/auto_ml/predictor.py#L491 > > >>> - (w/ ensembling) > > >>> - ['RandomForestRegressor', 'LinearRegression', > > >>> 'ExtraTreesRegressor', 'Ridge', 'GradientBoostingRegressor', > > >>> 'AdaBoostRegressor', 'Lasso', 'ElasticNet', 'LassoLars', > > >>> 'OrthogonalMatchingPursuit', 'BayesianRidge', 'SGDRegressor'] + [' > > >>> XGBRegressor'] > > >>> > > >>> - model stacking / ensembling > > >>> > > >>> - ( ) auto_ml: https://auto-ml.readthedocs. > > >> io/en/latest/ensembling.html > > >>> - ( ) auto-sklearn: > > >>> https://automl.github.io/auto-sklearn/stable/api.html# > > >>> autosklearn.regression.AutoSklearnRegressor > > >>> ensemble_size=50, ensemble_nbest=50 > > >>> > > >> > > >> https://en.wikipedia.org/wiki/Ensemble_learning > > >> > > >> http://www.scholarpedia.org/article/Ensemble_learning# > > >> Ensemble_combination_rules > > >> > > >> > > >>> > > >>> - submission['SalePrice'] = submission.SalePrice.apply(lambda x: > > >>> np.exp(x)) > > >>> > > >>> - [ ] What is this called / how does this work? > > >>> - https://docs.scipy.org/doc/numpy/reference/generated/ > > >> numpy.exp.html > > >>> > > >>> - df.to_csv(filename, columns=['SalePrice'], index_label='Id') also > > works > > >>> - http://pandas.pydata.org/pandas-docs/stable/generated/ > > >>> pandas.DataFrame.to_csv.html > > >>> > > >>> > > >>> > > >>>> My notebook is on GitHub for those interested: > > >>>> > > >>>> https://github.com/jeremy-doyle/home_price_kaggle/tree/ > > master/attempt_4 > > >>> > > >>> > > >>> Thanks! > > >>> > > >> > > >> (Trimmed for 40K limit) > > >> _______________________________________________ > > >> Omaha Python Users Group mailing list > > >> Omaha at python.org > > >> https://mail.python.org/mailman/listinfo/omaha > > >> http://www.OmahaPython.org > > >> > > > _______________________________________________ > > > Omaha Python Users Group mailing list > > > Omaha at python.org > > > https://mail.python.org/mailman/listinfo/omaha > > > http://www.OmahaPython.org > > > > _______________________________________________ > > Omaha Python Users Group mailing list > > Omaha at python.org > > https://mail.python.org/mailman/listinfo/omaha > > http://www.OmahaPython.org > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From bob.haffner at gmail.com Wed Jan 4 11:14:29 2017 From: bob.haffner at gmail.com (Bob Haffner) Date: Wed, 4 Jan 2017 10:14:29 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> <63E88FA3-5AB5-4F8D-A610-2FE27F2AB772@yahoo.com> Message-ID: Luke, agreed. I browsed thru the forum discussions the other day and was amazed at the techniques folks were using especially around feature engineering. My comment was directed at the top score which is extremely good and looks to be head and shoulders above the rest. On Wed, Jan 4, 2017 at 7:57 AM, Luke Schollmeyer via Omaha wrote: > I think there's two probable things: > 1. We're likely using some under-powered ML methods. Most of the Kaggle > interviews of the top guys/teams I read are using some much more advanced > methods to get their solutions into the top spots. I think what we're doing > is fine for what we want to accomplish. > 2. Feature engineering. Again, many of the interviews show that a ton of > work goes in to cleaning and conforming the data. > > I haven't back tracked any of the interviews to their submissions, so I > don't know how often they tend to submit, like tweak a small aspect and > keep honing that until it pays off. > > On Wed, Jan 4, 2017 at 7:43 AM, Bob Haffner via Omaha > wrote: > > > Yeah, no kidding. That pdf wasn't hard to find and that #1 score is > pretty > > damn good > > > > On Tue, Jan 3, 2017 at 10:41 PM, Jeremy Doyle via Omaha < > omaha at python.org> > > wrote: > > > > > Looks like we have our key to a score of 0.0. Lol > > > > > > Seriously though, does anyone wonder if the person sitting at #1 had > this > > > full data set as well and trained a model using the entire set? I mean > > that > > > 0.038 score is so much better than anyone else it seems a little > > > unrealistic...or maybe it's just seems that way because I haven't been > > able > > > to break through 0.12 : ) > > > > > > > > > > > > > > > > > > Sent from my iPhone > > > > On Jan 3, 2017, at 7:51 PM, Bob Haffner via Omaha > > > wrote: > > > > > > > > Pretty interesting notebook I put together regarding the kaggle comp > > > > https://github.com/bobhaffner/kaggle-houseprices/blob/ > > > master/additional_training_data.ipynb > > > > > > > > On Mon, Jan 2, 2017 at 12:10 AM, Wes Turner via Omaha < > > omaha at python.org> > > > > wrote: > > > > > > > >>> On Wednesday, December 28, 2016, Wes Turner > > > wrote: > > > >>> > > > >>> > > > >>> > > > >>> On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha < > > > >> omaha at python.org > > > >>> > wrote: > > > >>> > > > >>>> Woohoo! We jumped 286 positions with a meager 0.00448 improvement > in > > > our > > > >>>> score! Currently sitting at 798th place. > > > >>>> > > > >>> > > > >>> Nice work! Features of your feature engineering I admire: > > > >>> > > > >>> - nominal, ordinal, continuous, discrete > > > >>> categorical = nominal + discrete > > > >>> numeric = continuous + discrete > > > >>> > > > >>> - outlier removal > > > >>> - [ ] w/ constant thresholding? (is there a distribution > parameter) > > > >>> > > > >>> - building datestrings from SaleMonth and YrSold > > > >>> - SaleMonth / "1" / YrSold > > > >>> - df..drop(['MoSold','YrSold','SaleMonth']) > > > >>> - [ ] why drop SaleMonth? > > > >>> - [ ] pandas.to_datetime[df['SaleMonth']) > > > >>> > > > >>> - merging with FHA Home Price Index for the month and region ("West > > > North > > > >>> Central") > > > >>> https://www.fhfa.gov/DataTools/Downloads/Documents/ > > > >>> HPI/HPI_PO_monthly_hist.xls > > > >>> - [ ] pandas.to_datetime > > > >>> - this should have every month, but the new merge_asof feature > is > > > >>> worth mentioning > > > >>> > > > >>> - manual binarization > > > >>> - [ ] how did you pick these? correlation after pd.get_dummies? > > > >>> - [ ] why floats? 1.0 / 1 (does it make a difference?) > > > >>> > > > >>> - Ames, IA nbrhood_multiplier > > > >>> - http://www.cityofames.org/home/showdocument?id=1024 > > > >>> > > > >>> - feature merging > > > >>> - BsmtFinSF = BsmtFinSF1 + BsmtFinSF2 > > > >>> - TotalBaths = BsmtFullBath + (BsmtHalfBath / 2.0) + FullBath + > > > >>> (HalfBath / 2.0) > > > >>> - ( ) IDK how a feature-selection pipeline could do this > > automatically > > > >>> > > > >>> - null value imputation > > > >>> - .isnull() = 0 > > > >>> - ( ) datacleaner incorrectly sets these to median or mode > > > >>> > > > >>> - log for skewed continuous and SalePrice > > > >>> - ( ) auto_ml: take_log_of_y does this for SalePrice > > > >>> > > > >>> - "Keeping only the columns we want" > > > >>> - [ ] 'Id' shouldn't be relevant (pd.read_csv(filename, > > > index_col='Id') > > > >>> > > > >>> > > > >>> - Binarization > > > >>> - pd.get_dummies(dummy_na=False) > > > >>> - [ ] (a Luke pointed out, concatenation keeps the same columns) > > > >>> rows = eng_train.shape[0] > > > >>> eng_merged = pd.concat(eng_train, eng_test) > > > >>> onehot_merged = pd.get_dummies(eng_merged, columns=nominal, > > > >>> dummy_na=False) > > > >>> onehot_train = eng_merged[:rows] > > > >>> onehot_test = eng_merged[rows:] > > > >>> > > > >>> - class RandomSelectionHelper > > > >>> - [ ] this could be generally helpful in sklean[-pandas] > > > >>> - https://github.com/paulgb/sklearn-pandas#cross-validation > > > >>> > > > >>> - Models to Search > > > >>> - {Ridge, Lasso, ElasticNet} > > > >>> > > > >>> - https://github.com/ClimbsRocks/auto_ml/blob/ > > > >>> master/auto_ml/predictor.py#L222 > > > >>> _get_estimator_names ( "regressor" ) > > > >>> - {XGBRegessor, GradientBoostingRegressor, RANSACRegressor, > > > >>> RandomForestRegressor, LinearRegression, AdaBoostRegressor, > > > >>> ExtraTreesRegressor} > > > >>> > > > >>> - https://github.com/ClimbsRocks/auto_ml/blob/ > > > >>> master/auto_ml/predictor.py#L491 > > > >>> - (w/ ensembling) > > > >>> - ['RandomForestRegressor', 'LinearRegression', > > > >>> 'ExtraTreesRegressor', 'Ridge', 'GradientBoostingRegressor', > > > >>> 'AdaBoostRegressor', 'Lasso', 'ElasticNet', 'LassoLars', > > > >>> 'OrthogonalMatchingPursuit', 'BayesianRidge', 'SGDRegressor'] + [' > > > >>> XGBRegressor'] > > > >>> > > > >>> - model stacking / ensembling > > > >>> > > > >>> - ( ) auto_ml: https://auto-ml.readthedocs. > > > >> io/en/latest/ensembling.html > > > >>> - ( ) auto-sklearn: > > > >>> https://automl.github.io/auto-sklearn/stable/api.html# > > > >>> autosklearn.regression.AutoSklearnRegressor > > > >>> ensemble_size=50, ensemble_nbest=50 > > > >>> > > > >> > > > >> https://en.wikipedia.org/wiki/Ensemble_learning > > > >> > > > >> http://www.scholarpedia.org/article/Ensemble_learning# > > > >> Ensemble_combination_rules > > > >> > > > >> > > > >>> > > > >>> - submission['SalePrice'] = submission.SalePrice.apply(lambda x: > > > >>> np.exp(x)) > > > >>> > > > >>> - [ ] What is this called / how does this work? > > > >>> - https://docs.scipy.org/doc/numpy/reference/generated/ > > > >> numpy.exp.html > > > >>> > > > >>> - df.to_csv(filename, columns=['SalePrice'], index_label='Id') also > > > works > > > >>> - http://pandas.pydata.org/pandas-docs/stable/generated/ > > > >>> pandas.DataFrame.to_csv.html > > > >>> > > > >>> > > > >>> > > > >>>> My notebook is on GitHub for those interested: > > > >>>> > > > >>>> https://github.com/jeremy-doyle/home_price_kaggle/tree/ > > > master/attempt_4 > > > >>> > > > >>> > > > >>> Thanks! > > > >>> > > > >> > > > >> (Trimmed for 40K limit) > > > >> _______________________________________________ > > > >> Omaha Python Users Group mailing list > > > >> Omaha at python.org > > > >> https://mail.python.org/mailman/listinfo/omaha > > > >> http://www.OmahaPython.org > > > >> > > > > _______________________________________________ > > > > Omaha Python Users Group mailing list > > > > Omaha at python.org > > > > https://mail.python.org/mailman/listinfo/omaha > > > > http://www.OmahaPython.org > > > > > > _______________________________________________ > > > Omaha Python Users Group mailing list > > > Omaha at python.org > > > https://mail.python.org/mailman/listinfo/omaha > > > http://www.OmahaPython.org > > > > > _______________________________________________ > > Omaha Python Users Group mailing list > > Omaha at python.org > > https://mail.python.org/mailman/listinfo/omaha > > http://www.OmahaPython.org > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From travis42 at gmail.com Wed Jan 4 09:50:02 2017 From: travis42 at gmail.com (Travis Smith) Date: Wed, 4 Jan 2017 08:50:02 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> <63E88FA3-5AB5-4F8D-A610-2FE27F2AB772@yahoo.com> Message-ID: <12F96626-6D10-4B23-87F8-8980C5069E57@gmail.com> Hey, new guy here. What's the challenge, exactly? I'm not a Kaggler yet, but I have taken some data science courses. -Travis > On Jan 4, 2017, at 7:57, Luke Schollmeyer via Omaha wrote: > > I think there's two probable things: > 1. We're likely using some under-powered ML methods. Most of the Kaggle > interviews of the top guys/teams I read are using some much more advanced > methods to get their solutions into the top spots. I think what we're doing > is fine for what we want to accomplish. > 2. Feature engineering. Again, many of the interviews show that a ton of > work goes in to cleaning and conforming the data. > > I haven't back tracked any of the interviews to their submissions, so I > don't know how often they tend to submit, like tweak a small aspect and > keep honing that until it pays off. > > On Wed, Jan 4, 2017 at 7:43 AM, Bob Haffner via Omaha > wrote: > >> Yeah, no kidding. That pdf wasn't hard to find and that #1 score is pretty >> damn good >> >> On Tue, Jan 3, 2017 at 10:41 PM, Jeremy Doyle via Omaha >> wrote: >> >>> Looks like we have our key to a score of 0.0. Lol >>> >>> Seriously though, does anyone wonder if the person sitting at #1 had this >>> full data set as well and trained a model using the entire set? I mean >> that >>> 0.038 score is so much better than anyone else it seems a little >>> unrealistic...or maybe it's just seems that way because I haven't been >> able >>> to break through 0.12 : ) >>> >>> >>> >>> >>> >>> Sent from my iPhone >>>>> On Jan 3, 2017, at 7:51 PM, Bob Haffner via Omaha >>>> wrote: >>>> >>>> Pretty interesting notebook I put together regarding the kaggle comp >>>> https://github.com/bobhaffner/kaggle-houseprices/blob/ >>> master/additional_training_data.ipynb >>>> >>>> On Mon, Jan 2, 2017 at 12:10 AM, Wes Turner via Omaha < >> omaha at python.org> >>>> wrote: >>>> >>>>>> On Wednesday, December 28, 2016, Wes Turner >>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha < >>>>> omaha at python.org >>>>>> > wrote: >>>>>> >>>>>>> Woohoo! We jumped 286 positions with a meager 0.00448 improvement in >>> our >>>>>>> score! Currently sitting at 798th place. >>>>>> >>>>>> Nice work! Features of your feature engineering I admire: >>>>>> >>>>>> - nominal, ordinal, continuous, discrete >>>>>> categorical = nominal + discrete >>>>>> numeric = continuous + discrete >>>>>> >>>>>> - outlier removal >>>>>> - [ ] w/ constant thresholding? (is there a distribution parameter) >>>>>> >>>>>> - building datestrings from SaleMonth and YrSold >>>>>> - SaleMonth / "1" / YrSold >>>>>> - df..drop(['MoSold','YrSold','SaleMonth']) >>>>>> - [ ] why drop SaleMonth? >>>>>> - [ ] pandas.to_datetime[df['SaleMonth']) >>>>>> >>>>>> - merging with FHA Home Price Index for the month and region ("West >>> North >>>>>> Central") >>>>>> https://www.fhfa.gov/DataTools/Downloads/Documents/ >>>>>> HPI/HPI_PO_monthly_hist.xls >>>>>> - [ ] pandas.to_datetime >>>>>> - this should have every month, but the new merge_asof feature is >>>>>> worth mentioning >>>>>> >>>>>> - manual binarization >>>>>> - [ ] how did you pick these? correlation after pd.get_dummies? >>>>>> - [ ] why floats? 1.0 / 1 (does it make a difference?) >>>>>> >>>>>> - Ames, IA nbrhood_multiplier >>>>>> - http://www.cityofames.org/home/showdocument?id=1024 >>>>>> >>>>>> - feature merging >>>>>> - BsmtFinSF = BsmtFinSF1 + BsmtFinSF2 >>>>>> - TotalBaths = BsmtFullBath + (BsmtHalfBath / 2.0) + FullBath + >>>>>> (HalfBath / 2.0) >>>>>> - ( ) IDK how a feature-selection pipeline could do this >> automatically >>>>>> >>>>>> - null value imputation >>>>>> - .isnull() = 0 >>>>>> - ( ) datacleaner incorrectly sets these to median or mode >>>>>> >>>>>> - log for skewed continuous and SalePrice >>>>>> - ( ) auto_ml: take_log_of_y does this for SalePrice >>>>>> >>>>>> - "Keeping only the columns we want" >>>>>> - [ ] 'Id' shouldn't be relevant (pd.read_csv(filename, >>> index_col='Id') >>>>>> >>>>>> >>>>>> - Binarization >>>>>> - pd.get_dummies(dummy_na=False) >>>>>> - [ ] (a Luke pointed out, concatenation keeps the same columns) >>>>>> rows = eng_train.shape[0] >>>>>> eng_merged = pd.concat(eng_train, eng_test) >>>>>> onehot_merged = pd.get_dummies(eng_merged, columns=nominal, >>>>>> dummy_na=False) >>>>>> onehot_train = eng_merged[:rows] >>>>>> onehot_test = eng_merged[rows:] >>>>>> >>>>>> - class RandomSelectionHelper >>>>>> - [ ] this could be generally helpful in sklean[-pandas] >>>>>> - https://github.com/paulgb/sklearn-pandas#cross-validation >>>>>> >>>>>> - Models to Search >>>>>> - {Ridge, Lasso, ElasticNet} >>>>>> >>>>>> - https://github.com/ClimbsRocks/auto_ml/blob/ >>>>>> master/auto_ml/predictor.py#L222 >>>>>> _get_estimator_names ( "regressor" ) >>>>>> - {XGBRegessor, GradientBoostingRegressor, RANSACRegressor, >>>>>> RandomForestRegressor, LinearRegression, AdaBoostRegressor, >>>>>> ExtraTreesRegressor} >>>>>> >>>>>> - https://github.com/ClimbsRocks/auto_ml/blob/ >>>>>> master/auto_ml/predictor.py#L491 >>>>>> - (w/ ensembling) >>>>>> - ['RandomForestRegressor', 'LinearRegression', >>>>>> 'ExtraTreesRegressor', 'Ridge', 'GradientBoostingRegressor', >>>>>> 'AdaBoostRegressor', 'Lasso', 'ElasticNet', 'LassoLars', >>>>>> 'OrthogonalMatchingPursuit', 'BayesianRidge', 'SGDRegressor'] + [' >>>>>> XGBRegressor'] >>>>>> >>>>>> - model stacking / ensembling >>>>>> >>>>>> - ( ) auto_ml: https://auto-ml.readthedocs. >>>>> io/en/latest/ensembling.html >>>>>> - ( ) auto-sklearn: >>>>>> https://automl.github.io/auto-sklearn/stable/api.html# >>>>>> autosklearn.regression.AutoSklearnRegressor >>>>>> ensemble_size=50, ensemble_nbest=50 >>>>> >>>>> https://en.wikipedia.org/wiki/Ensemble_learning >>>>> >>>>> http://www.scholarpedia.org/article/Ensemble_learning# >>>>> Ensemble_combination_rules >>>>> >>>>> >>>>>> >>>>>> - submission['SalePrice'] = submission.SalePrice.apply(lambda x: >>>>>> np.exp(x)) >>>>>> >>>>>> - [ ] What is this called / how does this work? >>>>>> - https://docs.scipy.org/doc/numpy/reference/generated/ >>>>> numpy.exp.html >>>>>> >>>>>> - df.to_csv(filename, columns=['SalePrice'], index_label='Id') also >>> works >>>>>> - http://pandas.pydata.org/pandas-docs/stable/generated/ >>>>>> pandas.DataFrame.to_csv.html >>>>>> >>>>>> >>>>>> >>>>>>> My notebook is on GitHub for those interested: >>>>>>> >>>>>>> https://github.com/jeremy-doyle/home_price_kaggle/tree/ >>> master/attempt_4 >>>>>> >>>>>> >>>>>> Thanks! >>>>> >>>>> (Trimmed for 40K limit) >>>>> _______________________________________________ >>>>> Omaha Python Users Group mailing list >>>>> Omaha at python.org >>>>> https://mail.python.org/mailman/listinfo/omaha >>>>> http://www.OmahaPython.org >>>> _______________________________________________ >>>> Omaha Python Users Group mailing list >>>> Omaha at python.org >>>> https://mail.python.org/mailman/listinfo/omaha >>>> http://www.OmahaPython.org >>> >>> _______________________________________________ >>> Omaha Python Users Group mailing list >>> Omaha at python.org >>> https://mail.python.org/mailman/listinfo/omaha >>> http://www.OmahaPython.org >> _______________________________________________ >> Omaha Python Users Group mailing list >> Omaha at python.org >> https://mail.python.org/mailman/listinfo/omaha >> http://www.OmahaPython.org > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org From bob.haffner at gmail.com Thu Jan 5 12:20:15 2017 From: bob.haffner at gmail.com (Bob Haffner) Date: Thu, 5 Jan 2017 11:20:15 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: <12F96626-6D10-4B23-87F8-8980C5069E57@gmail.com> References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> <63E88FA3-5AB5-4F8D-A610-2FE27F2AB772@yahoo.com> <12F96626-6D10-4B23-87F8-8980C5069E57@gmail.com> Message-ID: Hi Travis, A few of us are doing the House Prices: Advanced Regression Techniques competition https://www.kaggle.com/c/house-prices-advanced-regression-techniques Our team is called Omaha Pythonistas. You are more than welcome to join us! Just let me know which email you use to sign up with on Kaggle and I?ll send out an invite. We met in December and we hope to meet again soon. Most likely following our monthly meeting on 1/18 Some our materials https://github.com/omahapython/kaggle-houseprices https://github.com/jeremy-doyle/home_price_kaggle https://github.com/bobhaffner/kaggle-houseprices On Wed, Jan 4, 2017 at 8:50 AM, Travis Smith via Omaha wrote: > Hey, new guy here. What's the challenge, exactly? I'm not a Kaggler yet, > but I have taken some data science courses. > > -Travis > > > On Jan 4, 2017, at 7:57, Luke Schollmeyer via Omaha > wrote: > > > > I think there's two probable things: > > 1. We're likely using some under-powered ML methods. Most of the Kaggle > > interviews of the top guys/teams I read are using some much more advanced > > methods to get their solutions into the top spots. I think what we're > doing > > is fine for what we want to accomplish. > > 2. Feature engineering. Again, many of the interviews show that a ton of > > work goes in to cleaning and conforming the data. > > > > I haven't back tracked any of the interviews to their submissions, so I > > don't know how often they tend to submit, like tweak a small aspect and > > keep honing that until it pays off. > > > > On Wed, Jan 4, 2017 at 7:43 AM, Bob Haffner via Omaha > > wrote: > > > >> Yeah, no kidding. That pdf wasn't hard to find and that #1 score is > pretty > >> damn good > >> > >> On Tue, Jan 3, 2017 at 10:41 PM, Jeremy Doyle via Omaha < > omaha at python.org> > >> wrote: > >> > >>> Looks like we have our key to a score of 0.0. Lol > >>> > >>> Seriously though, does anyone wonder if the person sitting at #1 had > this > >>> full data set as well and trained a model using the entire set? I mean > >> that > >>> 0.038 score is so much better than anyone else it seems a little > >>> unrealistic...or maybe it's just seems that way because I haven't been > >> able > >>> to break through 0.12 : ) > >>> > >>> > >>> > >>> > >>> > >>> Sent from my iPhone > >>>>> On Jan 3, 2017, at 7:51 PM, Bob Haffner via Omaha > >>>> wrote: > >>>> > >>>> Pretty interesting notebook I put together regarding the kaggle comp > >>>> https://github.com/bobhaffner/kaggle-houseprices/blob/ > >>> master/additional_training_data.ipynb > >>>> > >>>> On Mon, Jan 2, 2017 at 12:10 AM, Wes Turner via Omaha < > >> omaha at python.org> > >>>> wrote: > >>>> > >>>>>> On Wednesday, December 28, 2016, Wes Turner > >>> wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha < > >>>>> omaha at python.org > >>>>>> > wrote: > >>>>>> > >>>>>>> Woohoo! We jumped 286 positions with a meager 0.00448 improvement > in > >>> our > >>>>>>> score! Currently sitting at 798th place. > >>>>>> > >>>>>> Nice work! Features of your feature engineering I admire: > >>>>>> > >>>>>> - nominal, ordinal, continuous, discrete > >>>>>> categorical = nominal + discrete > >>>>>> numeric = continuous + discrete > >>>>>> > >>>>>> - outlier removal > >>>>>> - [ ] w/ constant thresholding? (is there a distribution parameter) > >>>>>> > >>>>>> - building datestrings from SaleMonth and YrSold > >>>>>> - SaleMonth / "1" / YrSold > >>>>>> - df..drop(['MoSold','YrSold','SaleMonth']) > >>>>>> - [ ] why drop SaleMonth? > >>>>>> - [ ] pandas.to_datetime[df['SaleMonth']) > >>>>>> > >>>>>> - merging with FHA Home Price Index for the month and region ("West > >>> North > >>>>>> Central") > >>>>>> https://www.fhfa.gov/DataTools/Downloads/Documents/ > >>>>>> HPI/HPI_PO_monthly_hist.xls > >>>>>> - [ ] pandas.to_datetime > >>>>>> - this should have every month, but the new merge_asof feature is > >>>>>> worth mentioning > >>>>>> > >>>>>> - manual binarization > >>>>>> - [ ] how did you pick these? correlation after pd.get_dummies? > >>>>>> - [ ] why floats? 1.0 / 1 (does it make a difference?) > >>>>>> > >>>>>> - Ames, IA nbrhood_multiplier > >>>>>> - http://www.cityofames.org/home/showdocument?id=1024 > >>>>>> > >>>>>> - feature merging > >>>>>> - BsmtFinSF = BsmtFinSF1 + BsmtFinSF2 > >>>>>> - TotalBaths = BsmtFullBath + (BsmtHalfBath / 2.0) + FullBath + > >>>>>> (HalfBath / 2.0) > >>>>>> - ( ) IDK how a feature-selection pipeline could do this > >> automatically > >>>>>> > >>>>>> - null value imputation > >>>>>> - .isnull() = 0 > >>>>>> - ( ) datacleaner incorrectly sets these to median or mode > >>>>>> > >>>>>> - log for skewed continuous and SalePrice > >>>>>> - ( ) auto_ml: take_log_of_y does this for SalePrice > >>>>>> > >>>>>> - "Keeping only the columns we want" > >>>>>> - [ ] 'Id' shouldn't be relevant (pd.read_csv(filename, > >>> index_col='Id') > >>>>>> > >>>>>> > >>>>>> - Binarization > >>>>>> - pd.get_dummies(dummy_na=False) > >>>>>> - [ ] (a Luke pointed out, concatenation keeps the same columns) > >>>>>> rows = eng_train.shape[0] > >>>>>> eng_merged = pd.concat(eng_train, eng_test) > >>>>>> onehot_merged = pd.get_dummies(eng_merged, columns=nominal, > >>>>>> dummy_na=False) > >>>>>> onehot_train = eng_merged[:rows] > >>>>>> onehot_test = eng_merged[rows:] > >>>>>> > >>>>>> - class RandomSelectionHelper > >>>>>> - [ ] this could be generally helpful in sklean[-pandas] > >>>>>> - https://github.com/paulgb/sklearn-pandas#cross-validation > >>>>>> > >>>>>> - Models to Search > >>>>>> - {Ridge, Lasso, ElasticNet} > >>>>>> > >>>>>> - https://github.com/ClimbsRocks/auto_ml/blob/ > >>>>>> master/auto_ml/predictor.py#L222 > >>>>>> _get_estimator_names ( "regressor" ) > >>>>>> - {XGBRegessor, GradientBoostingRegressor, RANSACRegressor, > >>>>>> RandomForestRegressor, LinearRegression, AdaBoostRegressor, > >>>>>> ExtraTreesRegressor} > >>>>>> > >>>>>> - https://github.com/ClimbsRocks/auto_ml/blob/ > >>>>>> master/auto_ml/predictor.py#L491 > >>>>>> - (w/ ensembling) > >>>>>> - ['RandomForestRegressor', 'LinearRegression', > >>>>>> 'ExtraTreesRegressor', 'Ridge', 'GradientBoostingRegressor', > >>>>>> 'AdaBoostRegressor', 'Lasso', 'ElasticNet', 'LassoLars', > >>>>>> 'OrthogonalMatchingPursuit', 'BayesianRidge', 'SGDRegressor'] + [' > >>>>>> XGBRegressor'] > >>>>>> > >>>>>> - model stacking / ensembling > >>>>>> > >>>>>> - ( ) auto_ml: https://auto-ml.readthedocs. > >>>>> io/en/latest/ensembling.html > >>>>>> - ( ) auto-sklearn: > >>>>>> https://automl.github.io/auto-sklearn/stable/api.html# > >>>>>> autosklearn.regression.AutoSklearnRegressor > >>>>>> ensemble_size=50, ensemble_nbest=50 > >>>>> > >>>>> https://en.wikipedia.org/wiki/Ensemble_learning > >>>>> > >>>>> http://www.scholarpedia.org/article/Ensemble_learning# > >>>>> Ensemble_combination_rules > >>>>> > >>>>> > >>>>>> > >>>>>> - submission['SalePrice'] = submission.SalePrice.apply(lambda x: > >>>>>> np.exp(x)) > >>>>>> > >>>>>> - [ ] What is this called / how does this work? > >>>>>> - https://docs.scipy.org/doc/numpy/reference/generated/ > >>>>> numpy.exp.html > >>>>>> > >>>>>> - df.to_csv(filename, columns=['SalePrice'], index_label='Id') also > >>> works > >>>>>> - http://pandas.pydata.org/pandas-docs/stable/generated/ > >>>>>> pandas.DataFrame.to_csv.html > >>>>>> > >>>>>> > >>>>>> > >>>>>>> My notebook is on GitHub for those interested: > >>>>>>> > >>>>>>> https://github.com/jeremy-doyle/home_price_kaggle/tree/ > >>> master/attempt_4 > >>>>>> > >>>>>> > >>>>>> Thanks! > >>>>> > >>>>> (Trimmed for 40K limit) > >>>>> _______________________________________________ > >>>>> Omaha Python Users Group mailing list > >>>>> Omaha at python.org > >>>>> https://mail.python.org/mailman/listinfo/omaha > >>>>> http://www.OmahaPython.org > >>>> _______________________________________________ > >>>> Omaha Python Users Group mailing list > >>>> Omaha at python.org > >>>> https://mail.python.org/mailman/listinfo/omaha > >>>> http://www.OmahaPython.org > >>> > >>> _______________________________________________ > >>> Omaha Python Users Group mailing list > >>> Omaha at python.org > >>> https://mail.python.org/mailman/listinfo/omaha > >>> http://www.OmahaPython.org > >> _______________________________________________ > >> Omaha Python Users Group mailing list > >> Omaha at python.org > >> https://mail.python.org/mailman/listinfo/omaha > >> http://www.OmahaPython.org > > _______________________________________________ > > Omaha Python Users Group mailing list > > Omaha at python.org > > https://mail.python.org/mailman/listinfo/omaha > > http://www.OmahaPython.org > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From wereapwhatwesow at gmail.com Mon Jan 9 16:52:51 2017 From: wereapwhatwesow at gmail.com (Steve Young) Date: Mon, 9 Jan 2017 15:52:51 -0600 Subject: [omaha] Meeting Location for larger group - next week's meeting Message-ID: I received this request from Becky (who is presenting at next week's meeting); "Dr. Betty Love at UNO would like to invite her two Graduate classes, Computational OR and Network Programming. There would be about 8 students if they all showed (which I doubt all would show). Would that be okay? Becky Brusky (I?m presenting the Suduko python what we worked on in Betty?s Integer Programming class)" I would like to accommodate her request, but we have the large conference room booked at DoSpace, and they restrict it to 10 people. I think we would be over the limit if we had even 4 extra people attend. The large meeting room at Alley Poyner is available that evening. Any other ideas? http://www.omahapython.org/blog/archives/event/january-meeting-linear-programming-with-python-and-gurobi?instance_id=32 Steve From bob.haffner at gmail.com Mon Jan 9 21:06:04 2017 From: bob.haffner at gmail.com (Bob Haffner) Date: Mon, 9 Jan 2017 20:06:04 -0600 Subject: [omaha] Meeting Location for larger group - next week's meeting In-Reply-To: References: Message-ID: Alley Poyner room would be fine by me On Mon, Jan 9, 2017 at 3:52 PM, Steve Young via Omaha wrote: > I received this request from Becky (who is presenting at next week's > meeting); > > "Dr. Betty Love at UNO would like to invite her two Graduate classes, > Computational OR and Network Programming. There would be about 8 students > if they all showed (which I doubt all would show). Would that be okay? > Becky Brusky > (I?m presenting the Suduko python what we worked on in Betty?s Integer > Programming class)" > > I would like to accommodate her request, but we have the large conference > room booked at DoSpace, and they restrict it to 10 people. I think we > would be over the limit if we had even 4 extra people attend. > > The large meeting room at Alley Poyner is available that evening. > > Any other ideas? > > http://www.omahapython.org/blog/archives/event/january- > meeting-linear-programming-with-python-and-gurobi?instance_id=32 > > Steve > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org From wereapwhatwesow at gmail.com Tue Jan 10 16:06:18 2017 From: wereapwhatwesow at gmail.com (Steve Young) Date: Tue, 10 Jan 2017 15:06:18 -0600 Subject: [omaha] Meeting Location for larger group - next week's meeting Message-ID: The meeting will be held at Alley Poyner, in the large conference room. It can hold 30+ people, so invite as many people as you would like. Date - Wed, 1/18 (a week from tomorrow) Topic - January Meeting ? Linear Programming with Python and Gurobi Presenter - Becky Brusky Location - 1516 Cuming Street, Alley Poyner Macchietto Architecture http://www.omahapython.org/blog/archives/event/january-meeting-linear-programming-with-python-and-gurobi The Kaggle competition group may meet after the presentation. Should be a great meeting - See you there. Steve On Mon, Jan 9, 2017 at 8:06 PM, Bob Haffner wrote: > Alley Poyner room would be fine by me > > On Mon, Jan 9, 2017 at 3:52 PM, Steve Young via Omaha > wrote: > >> I received this request from Becky (who is presenting at next week's >> meeting); >> >> "Dr. Betty Love at UNO would like to invite her two Graduate classes, >> Computational OR and Network Programming. There would be about 8 >> students >> if they all showed (which I doubt all would show). Would that be okay? >> Becky Brusky >> (I?m presenting the Suduko python what we worked on in Betty?s Integer >> Programming class)" >> >> I would like to accommodate her request, but we have the large conference >> room booked at DoSpace, and they restrict it to 10 people. I think we >> would be over the limit if we had even 4 extra people attend. >> >> The large meeting room at Alley Poyner is available that evening. >> >> Any other ideas? >> >> http://www.omahapython.org/blog/archives/event/january-meeti >> ng-linear-programming-with-python-and-gurobi?instance_id=32 >> >> Steve >> _______________________________________________ >> Omaha Python Users Group mailing list >> Omaha at python.org >> https://mail.python.org/mailman/listinfo/omaha >> http://www.OmahaPython.org > > > From Becky_Brusky at unigroup.com Tue Jan 10 16:07:37 2017 From: Becky_Brusky at unigroup.com (Brusky, Becky) Date: Tue, 10 Jan 2017 21:07:37 +0000 Subject: [omaha] Meeting Location for larger group - next week's meeting In-Reply-To: References: Message-ID: Thanks Steve. I'll send the invite on. From: Steve Young > Date: Tuesday, January 10, 2017 at 3:06 PM To: Bob Haffner >, "Brusky, Becky" > Cc: Omaha Python Users Group > Subject: Re: [omaha] Meeting Location for larger group - next week's meeting The meeting will be held at Alley Poyner, in the large conference room. It can hold 30+ people, so invite as many people as you would like. Date - Wed, 1/18 (a week from tomorrow) Topic - January Meeting - Linear Programming with Python and Gurobi Presenter - Becky Brusky Location - 1516 Cuming Street, Alley Poyner Macchietto Architecture http://www.omahapython.org/blog/archives/event/january-meeting-linear-programming-with-python-and-gurobi The Kaggle competition group may meet after the presentation. Should be a great meeting - See you there. Steve On Mon, Jan 9, 2017 at 8:06 PM, Bob Haffner > wrote: Alley Poyner room would be fine by me On Mon, Jan 9, 2017 at 3:52 PM, Steve Young via Omaha > wrote: I received this request from Becky (who is presenting at next week's meeting); "Dr. Betty Love at UNO would like to invite her two Graduate classes, Computational OR and Network Programming. There would be about 8 students if they all showed (which I doubt all would show). Would that be okay? Becky Brusky (I'm presenting the Suduko python what we worked on in Betty's Integer Programming class)" I would like to accommodate her request, but we have the large conference room booked at DoSpace, and they restrict it to 10 people. I think we would be over the limit if we had even 4 extra people attend. The large meeting room at Alley Poyner is available that evening. Any other ideas? http://www.omahapython.org/blog/archives/event/january-meeting-linear-programming-with-python-and-gurobi?instance_id=32 Steve _______________________________________________ Omaha Python Users Group mailing list Omaha at python.org https://mail.python.org/mailman/listinfo/omaha http://www.OmahaPython.org ######################################################################## The information contained in this message, and any attachments thereto, is intended solely for the use of the addressee(s) and may contain confidential and/or privileged material. Any review, retransmission, dissemination, copying, or other use of the transmitted information is prohibited. If you received this in error, please contact the sender and delete the material from any computer. UNIGROUP.COM ######################################################################## From bob.haffner at gmail.com Tue Jan 10 18:48:12 2017 From: bob.haffner at gmail.com (Bob Haffner) Date: Tue, 10 Jan 2017 17:48:12 -0600 Subject: [omaha] Meeting Location for larger group - next week's meeting In-Reply-To: References: Message-ID: Thanks for hosting, Steve! Kagglers, let's try to meet following Becky's presentation. On Tue, Jan 10, 2017 at 3:06 PM, Steve Young wrote: > The meeting will be held at Alley Poyner, in the large conference room. > It can hold 30+ people, so invite as many people as you would like. > > Date - Wed, 1/18 (a week from tomorrow) > Topic - January Meeting ? Linear Programming with Python and Gurobi > Presenter - Becky Brusky > Location - 1516 Cuming Street, Alley Poyner Macchietto Architecture > http://www.omahapython.org/blog/archives/event/january- > meeting-linear-programming-with-python-and-gurobi > > The Kaggle competition group may meet after the presentation. > > Should be a great meeting - See you there. > > Steve > > > On Mon, Jan 9, 2017 at 8:06 PM, Bob Haffner wrote: > >> Alley Poyner room would be fine by me >> >> On Mon, Jan 9, 2017 at 3:52 PM, Steve Young via Omaha >> wrote: >> >>> I received this request from Becky (who is presenting at next week's >>> meeting); >>> >>> "Dr. Betty Love at UNO would like to invite her two Graduate classes, >>> Computational OR and Network Programming. There would be about 8 >>> students >>> if they all showed (which I doubt all would show). Would that be okay? >>> Becky Brusky >>> (I?m presenting the Suduko python what we worked on in Betty?s Integer >>> Programming class)" >>> >>> I would like to accommodate her request, but we have the large conference >>> room booked at DoSpace, and they restrict it to 10 people. I think we >>> would be over the limit if we had even 4 extra people attend. >>> >>> The large meeting room at Alley Poyner is available that evening. >>> >>> Any other ideas? >>> >>> http://www.omahapython.org/blog/archives/event/january-meeti >>> ng-linear-programming-with-python-and-gurobi?instance_id=32 >>> >>> Steve >>> _______________________________________________ >>> Omaha Python Users Group mailing list >>> Omaha at python.org >>> https://mail.python.org/mailman/listinfo/omaha >>> http://www.OmahaPython.org >> >> >> > From hubert.hickman at gmail.com Tue Jan 10 21:43:10 2017 From: hubert.hickman at gmail.com (Hubert Hickman) Date: Tue, 10 Jan 2017 20:43:10 -0600 Subject: [omaha] Meeting Location for larger group - next week's meeting In-Reply-To: References: Message-ID: The meeting is on Tech Omaha.... On Tue, Jan 10, 2017 at 5:48 PM, Bob Haffner via Omaha wrote: > Thanks for hosting, Steve! > > Kagglers, let's try to meet following Becky's presentation. > > On Tue, Jan 10, 2017 at 3:06 PM, Steve Young > wrote: > > > The meeting will be held at Alley Poyner, in the large conference room. > > It can hold 30+ people, so invite as many people as you would like. > > > > Date - Wed, 1/18 (a week from tomorrow) > > Topic - January Meeting ? Linear Programming with Python and Gurobi > > Presenter - Becky Brusky > > Location - 1516 Cuming Street, Alley Poyner Macchietto Architecture > > http://www.omahapython.org/blog/archives/event/january- > > meeting-linear-programming-with-python-and-gurobi > > > > The Kaggle competition group may meet after the presentation. > > > > Should be a great meeting - See you there. > > > > Steve > > > > > > On Mon, Jan 9, 2017 at 8:06 PM, Bob Haffner > wrote: > > > >> Alley Poyner room would be fine by me > >> > >> On Mon, Jan 9, 2017 at 3:52 PM, Steve Young via Omaha > > >> wrote: > >> > >>> I received this request from Becky (who is presenting at next week's > >>> meeting); > >>> > >>> "Dr. Betty Love at UNO would like to invite her two Graduate classes, > >>> Computational OR and Network Programming. There would be about 8 > >>> students > >>> if they all showed (which I doubt all would show). Would that be okay? > >>> Becky Brusky > >>> (I?m presenting the Suduko python what we worked on in Betty?s Integer > >>> Programming class)" > >>> > >>> I would like to accommodate her request, but we have the large > conference > >>> room booked at DoSpace, and they restrict it to 10 people. I think we > >>> would be over the limit if we had even 4 extra people attend. > >>> > >>> The large meeting room at Alley Poyner is available that evening. > >>> > >>> Any other ideas? > >>> > >>> http://www.omahapython.org/blog/archives/event/january-meeti > >>> ng-linear-programming-with-python-and-gurobi?instance_id=32 > >>> > >>> Steve > >>> _______________________________________________ > >>> Omaha Python Users Group mailing list > >>> Omaha at python.org > >>> https://mail.python.org/mailman/listinfo/omaha > >>> http://www.OmahaPython.org > >> > >> > >> > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From bob.haffner at gmail.com Thu Jan 12 10:15:01 2017 From: bob.haffner at gmail.com (Bob Haffner) Date: Thu, 12 Jan 2017 09:15:01 -0600 Subject: [omaha] Meeting Location for larger group - next week's meeting In-Reply-To: References: Message-ID: Hi all, Well, shoot. I won't be able to attend Wednesday's meeting. I was really looking forward to Becky's presentation. Bob On Tue, Jan 10, 2017 at 8:43 PM, Hubert Hickman via Omaha wrote: > The meeting is on Tech Omaha.... > > On Tue, Jan 10, 2017 at 5:48 PM, Bob Haffner via Omaha > wrote: > > > Thanks for hosting, Steve! > > > > Kagglers, let's try to meet following Becky's presentation. > > > > On Tue, Jan 10, 2017 at 3:06 PM, Steve Young > > wrote: > > > > > The meeting will be held at Alley Poyner, in the large conference room. > > > It can hold 30+ people, so invite as many people as you would like. > > > > > > Date - Wed, 1/18 (a week from tomorrow) > > > Topic - January Meeting ? Linear Programming with Python and Gurobi > > > Presenter - Becky Brusky > > > Location - 1516 Cuming Street, Alley Poyner Macchietto Architecture > > > http://www.omahapython.org/blog/archives/event/january- > > > meeting-linear-programming-with-python-and-gurobi > > > > > > The Kaggle competition group may meet after the presentation. > > > > > > Should be a great meeting - See you there. > > > > > > Steve > > > > > > > > > On Mon, Jan 9, 2017 at 8:06 PM, Bob Haffner > > wrote: > > > > > >> Alley Poyner room would be fine by me > > >> > > >> On Mon, Jan 9, 2017 at 3:52 PM, Steve Young via Omaha < > omaha at python.org > > > > > >> wrote: > > >> > > >>> I received this request from Becky (who is presenting at next week's > > >>> meeting); > > >>> > > >>> "Dr. Betty Love at UNO would like to invite her two Graduate classes, > > >>> Computational OR and Network Programming. There would be about 8 > > >>> students > > >>> if they all showed (which I doubt all would show). Would that be > okay? > > >>> Becky Brusky > > >>> (I?m presenting the Suduko python what we worked on in Betty?s > Integer > > >>> Programming class)" > > >>> > > >>> I would like to accommodate her request, but we have the large > > conference > > >>> room booked at DoSpace, and they restrict it to 10 people. I think > we > > >>> would be over the limit if we had even 4 extra people attend. > > >>> > > >>> The large meeting room at Alley Poyner is available that evening. > > >>> > > >>> Any other ideas? > > >>> > > >>> http://www.omahapython.org/blog/archives/event/january-meeti > > >>> ng-linear-programming-with-python-and-gurobi?instance_id=32 > > >>> > > >>> Steve > > >>> _______________________________________________ > > >>> Omaha Python Users Group mailing list > > >>> Omaha at python.org > > >>> https://mail.python.org/mailman/listinfo/omaha > > >>> http://www.OmahaPython.org > > >> > > >> > > >> > > > > > _______________________________________________ > > Omaha Python Users Group mailing list > > Omaha at python.org > > https://mail.python.org/mailman/listinfo/omaha > > http://www.OmahaPython.org > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From bob.haffner at gmail.com Fri Jan 13 23:31:32 2017 From: bob.haffner at gmail.com (Bob Haffner) Date: Fri, 13 Jan 2017 22:31:32 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> <63E88FA3-5AB5-4F8D-A610-2FE27F2AB772@yahoo.com> <12F96626-6D10-4B23-87F8-8980C5069E57@gmail.com> Message-ID: Look at that. Two teams have submitted perfect scores :-) https://www.kaggle.com/c/house-prices-advanced-regression-techniques/leaderboard On Thu, Jan 5, 2017 at 11:20 AM, Bob Haffner wrote: > Hi Travis, > > > > A few of us are doing the House Prices: Advanced Regression Techniques > competition > > https://www.kaggle.com/c/house-prices-advanced-regression-techniques > > > > Our team is called Omaha Pythonistas. You are more than welcome to join > us! Just let me know which email you use to sign up with on Kaggle and > I?ll send out an invite. > > > > We met in December and we hope to meet again soon. Most likely following > our monthly meeting on 1/18 > > > > Some our materials > > https://github.com/omahapython/kaggle-houseprices > > > > https://github.com/jeremy-doyle/home_price_kaggle > > > > https://github.com/bobhaffner/kaggle-houseprices > > On Wed, Jan 4, 2017 at 8:50 AM, Travis Smith via Omaha > wrote: > >> Hey, new guy here. What's the challenge, exactly? I'm not a Kaggler yet, >> but I have taken some data science courses. >> >> -Travis >> >> > On Jan 4, 2017, at 7:57, Luke Schollmeyer via Omaha >> wrote: >> > >> > I think there's two probable things: >> > 1. We're likely using some under-powered ML methods. Most of the Kaggle >> > interviews of the top guys/teams I read are using some much more >> advanced >> > methods to get their solutions into the top spots. I think what we're >> doing >> > is fine for what we want to accomplish. >> > 2. Feature engineering. Again, many of the interviews show that a ton of >> > work goes in to cleaning and conforming the data. >> > >> > I haven't back tracked any of the interviews to their submissions, so I >> > don't know how often they tend to submit, like tweak a small aspect and >> > keep honing that until it pays off. >> > >> > On Wed, Jan 4, 2017 at 7:43 AM, Bob Haffner via Omaha > > >> > wrote: >> > >> >> Yeah, no kidding. That pdf wasn't hard to find and that #1 score is >> pretty >> >> damn good >> >> >> >> On Tue, Jan 3, 2017 at 10:41 PM, Jeremy Doyle via Omaha < >> omaha at python.org> >> >> wrote: >> >> >> >>> Looks like we have our key to a score of 0.0. Lol >> >>> >> >>> Seriously though, does anyone wonder if the person sitting at #1 had >> this >> >>> full data set as well and trained a model using the entire set? I mean >> >> that >> >>> 0.038 score is so much better than anyone else it seems a little >> >>> unrealistic...or maybe it's just seems that way because I haven't been >> >> able >> >>> to break through 0.12 : ) >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> Sent from my iPhone >> >>>>> On Jan 3, 2017, at 7:51 PM, Bob Haffner via Omaha > > >> >>>> wrote: >> >>>> >> >>>> Pretty interesting notebook I put together regarding the kaggle comp >> >>>> https://github.com/bobhaffner/kaggle-houseprices/blob/ >> >>> master/additional_training_data.ipynb >> >>>> >> >>>> On Mon, Jan 2, 2017 at 12:10 AM, Wes Turner via Omaha < >> >> omaha at python.org> >> >>>> wrote: >> >>>> >> >>>>>> On Wednesday, December 28, 2016, Wes Turner >> >>> wrote: >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha < >> >>>>> omaha at python.org >> >>>>>> > wrote: >> >>>>>> >> >>>>>>> Woohoo! We jumped 286 positions with a meager 0.00448 improvement >> in >> >>> our >> >>>>>>> score! Currently sitting at 798th place. >> >>>>>> >> >>>>>> Nice work! Features of your feature engineering I admire: >> >>>>>> >> >>>>>> - nominal, ordinal, continuous, discrete >> >>>>>> categorical = nominal + discrete >> >>>>>> numeric = continuous + discrete >> >>>>>> >> >>>>>> - outlier removal >> >>>>>> - [ ] w/ constant thresholding? (is there a distribution parameter) >> >>>>>> >> >>>>>> - building datestrings from SaleMonth and YrSold >> >>>>>> - SaleMonth / "1" / YrSold >> >>>>>> - df..drop(['MoSold','YrSold','SaleMonth']) >> >>>>>> - [ ] why drop SaleMonth? >> >>>>>> - [ ] pandas.to_datetime[df['SaleMonth']) >> >>>>>> >> >>>>>> - merging with FHA Home Price Index for the month and region ("West >> >>> North >> >>>>>> Central") >> >>>>>> https://www.fhfa.gov/DataTools/Downloads/Documents/ >> >>>>>> HPI/HPI_PO_monthly_hist.xls >> >>>>>> - [ ] pandas.to_datetime >> >>>>>> - this should have every month, but the new merge_asof feature is >> >>>>>> worth mentioning >> >>>>>> >> >>>>>> - manual binarization >> >>>>>> - [ ] how did you pick these? correlation after pd.get_dummies? >> >>>>>> - [ ] why floats? 1.0 / 1 (does it make a difference?) >> >>>>>> >> >>>>>> - Ames, IA nbrhood_multiplier >> >>>>>> - http://www.cityofames.org/home/showdocument?id=1024 >> >>>>>> >> >>>>>> - feature merging >> >>>>>> - BsmtFinSF = BsmtFinSF1 + BsmtFinSF2 >> >>>>>> - TotalBaths = BsmtFullBath + (BsmtHalfBath / 2.0) + FullBath + >> >>>>>> (HalfBath / 2.0) >> >>>>>> - ( ) IDK how a feature-selection pipeline could do this >> >> automatically >> >>>>>> >> >>>>>> - null value imputation >> >>>>>> - .isnull() = 0 >> >>>>>> - ( ) datacleaner incorrectly sets these to median or mode >> >>>>>> >> >>>>>> - log for skewed continuous and SalePrice >> >>>>>> - ( ) auto_ml: take_log_of_y does this for SalePrice >> >>>>>> >> >>>>>> - "Keeping only the columns we want" >> >>>>>> - [ ] 'Id' shouldn't be relevant (pd.read_csv(filename, >> >>> index_col='Id') >> >>>>>> >> >>>>>> >> >>>>>> - Binarization >> >>>>>> - pd.get_dummies(dummy_na=False) >> >>>>>> - [ ] (a Luke pointed out, concatenation keeps the same columns) >> >>>>>> rows = eng_train.shape[0] >> >>>>>> eng_merged = pd.concat(eng_train, eng_test) >> >>>>>> onehot_merged = pd.get_dummies(eng_merged, columns=nominal, >> >>>>>> dummy_na=False) >> >>>>>> onehot_train = eng_merged[:rows] >> >>>>>> onehot_test = eng_merged[rows:] >> >>>>>> >> >>>>>> - class RandomSelectionHelper >> >>>>>> - [ ] this could be generally helpful in sklean[-pandas] >> >>>>>> - https://github.com/paulgb/sklearn-pandas#cross-validation >> >>>>>> >> >>>>>> - Models to Search >> >>>>>> - {Ridge, Lasso, ElasticNet} >> >>>>>> >> >>>>>> - https://github.com/ClimbsRocks/auto_ml/blob/ >> >>>>>> master/auto_ml/predictor.py#L222 >> >>>>>> _get_estimator_names ( "regressor" ) >> >>>>>> - {XGBRegessor, GradientBoostingRegressor, RANSACRegressor, >> >>>>>> RandomForestRegressor, LinearRegression, AdaBoostRegressor, >> >>>>>> ExtraTreesRegressor} >> >>>>>> >> >>>>>> - https://github.com/ClimbsRocks/auto_ml/blob/ >> >>>>>> master/auto_ml/predictor.py#L491 >> >>>>>> - (w/ ensembling) >> >>>>>> - ['RandomForestRegressor', 'LinearRegression', >> >>>>>> 'ExtraTreesRegressor', 'Ridge', 'GradientBoostingRegressor', >> >>>>>> 'AdaBoostRegressor', 'Lasso', 'ElasticNet', 'LassoLars', >> >>>>>> 'OrthogonalMatchingPursuit', 'BayesianRidge', 'SGDRegressor'] + [' >> >>>>>> XGBRegressor'] >> >>>>>> >> >>>>>> - model stacking / ensembling >> >>>>>> >> >>>>>> - ( ) auto_ml: https://auto-ml.readthedocs. >> >>>>> io/en/latest/ensembling.html >> >>>>>> - ( ) auto-sklearn: >> >>>>>> https://automl.github.io/auto-sklearn/stable/api.html# >> >>>>>> autosklearn.regression.AutoSklearnRegressor >> >>>>>> ensemble_size=50, ensemble_nbest=50 >> >>>>> >> >>>>> https://en.wikipedia.org/wiki/Ensemble_learning >> >>>>> >> >>>>> http://www.scholarpedia.org/article/Ensemble_learning# >> >>>>> Ensemble_combination_rules >> >>>>> >> >>>>> >> >>>>>> >> >>>>>> - submission['SalePrice'] = submission.SalePrice.apply(lambda x: >> >>>>>> np.exp(x)) >> >>>>>> >> >>>>>> - [ ] What is this called / how does this work? >> >>>>>> - https://docs.scipy.org/doc/numpy/reference/generated/ >> >>>>> numpy.exp.html >> >>>>>> >> >>>>>> - df.to_csv(filename, columns=['SalePrice'], index_label='Id') also >> >>> works >> >>>>>> - http://pandas.pydata.org/pandas-docs/stable/generated/ >> >>>>>> pandas.DataFrame.to_csv.html >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>>> My notebook is on GitHub for those interested: >> >>>>>>> >> >>>>>>> https://github.com/jeremy-doyle/home_price_kaggle/tree/ >> >>> master/attempt_4 >> >>>>>> >> >>>>>> >> >>>>>> Thanks! >> >>>>> >> >>>>> (Trimmed for 40K limit) >> >>>>> _______________________________________________ >> >>>>> Omaha Python Users Group mailing list >> >>>>> Omaha at python.org >> >>>>> https://mail.python.org/mailman/listinfo/omaha >> >>>>> http://www.OmahaPython.org >> >>>> _______________________________________________ >> >>>> Omaha Python Users Group mailing list >> >>>> Omaha at python.org >> >>>> https://mail.python.org/mailman/listinfo/omaha >> >>>> http://www.OmahaPython.org >> >>> >> >>> _______________________________________________ >> >>> Omaha Python Users Group mailing list >> >>> Omaha at python.org >> >>> https://mail.python.org/mailman/listinfo/omaha >> >>> http://www.OmahaPython.org >> >> _______________________________________________ >> >> Omaha Python Users Group mailing list >> >> Omaha at python.org >> >> https://mail.python.org/mailman/listinfo/omaha >> >> http://www.OmahaPython.org >> > _______________________________________________ >> > Omaha Python Users Group mailing list >> > Omaha at python.org >> > https://mail.python.org/mailman/listinfo/omaha >> > http://www.OmahaPython.org >> _______________________________________________ >> Omaha Python Users Group mailing list >> Omaha at python.org >> https://mail.python.org/mailman/listinfo/omaha >> http://www.OmahaPython.org >> > > From wes.turner at gmail.com Sat Jan 14 01:52:04 2017 From: wes.turner at gmail.com (Wes Turner) Date: Sat, 14 Jan 2017 00:52:04 -0600 Subject: [omaha] Group Data Science Competition In-Reply-To: References: <98FDF8B2-6371-4C4A-BA84-DD18AA7DC3A0@gmail.com> <63E88FA3-5AB5-4F8D-A610-2FE27F2AB772@yahoo.com> <12F96626-6D10-4B23-87F8-8980C5069E57@gmail.com> Message-ID: On Friday, January 13, 2017, Bob Haffner via Omaha wrote: > Look at that. Two teams have submitted perfect scores :-) > > https://www.kaggle.com/c/house-prices-advanced-regression-techniques/ > leaderboard https://www.kaggle.com/c/house-prices-advanced-regression-techniques/rules - Due to the public nature of the data, this competition does not count towards Kaggle ranking points. - We ask that you respect the spirit of the competition and do not cheat. Hand-labeling is forbidden. https://www.kaggle.com/wiki/ModelSubmissionBestPractices https://www.kaggle.com/wiki/WinningModelDocumentationTemplate (CNN, XGBoost) Hopefully I can find some time to fix the data loading function in my data.py and test w/ TPOT (manual sparse arrays), auto_ml, - https://www.coursera.org/learn/ml-foundations/lecture/2HrHv/learning-a-simple-regression-model-to-predict-house-prices-from-house-size (UW) - "Python Data Science Handbook" "This repository contains entire Python Data Science Handbook , in the form of (free!) Jupyter notebooks." https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/README.md#5-machine-learning (~UW) I'd also like to learn how to NN w/ tensors and Keras (Theano, TensorFlow) https://github.com/fchollet/keras - https://keras.io/getting-started/faq/#how-can-i-record-the-training-validation-loss-accuracy-at-each-epoch - http://machinelearningmastery.com/regression-tutorial-keras-deep-learning-library-python/ > On Thu, Jan 5, 2017 at 11:20 AM, Bob Haffner > wrote: > > > Hi Travis, > > > > > > > > A few of us are doing the House Prices: Advanced Regression Techniques > > competition > > > > https://www.kaggle.com/c/house-prices-advanced-regression-techniques > > > > > > > > Our team is called Omaha Pythonistas. You are more than welcome to join > > us! Just let me know which email you use to sign up with on Kaggle and > > I?ll send out an invite. > > > > > > > > We met in December and we hope to meet again soon. Most likely following > > our monthly meeting on 1/18 > > > > > > > > Some our materials > > > > https://github.com/omahapython/kaggle-houseprices > > > > > > > > https://github.com/jeremy-doyle/home_price_kaggle > > > > > > > > https://github.com/bobhaffner/kaggle-houseprices > > > > On Wed, Jan 4, 2017 at 8:50 AM, Travis Smith via Omaha > > > wrote: > > > >> Hey, new guy here. What's the challenge, exactly? I'm not a Kaggler > yet, > >> but I have taken some data science courses. > >> > >> -Travis > >> > >> > On Jan 4, 2017, at 7:57, Luke Schollmeyer via Omaha > > >> wrote: > >> > > >> > I think there's two probable things: > >> > 1. We're likely using some under-powered ML methods. Most of the > Kaggle > >> > interviews of the top guys/teams I read are using some much more > >> advanced > >> > methods to get their solutions into the top spots. I think what we're > >> doing > >> > is fine for what we want to accomplish. > >> > 2. Feature engineering. Again, many of the interviews show that a ton > of > >> > work goes in to cleaning and conforming the data. > >> > > >> > I haven't back tracked any of the interviews to their submissions, so > I > >> > don't know how often they tend to submit, like tweak a small aspect > and > >> > keep honing that until it pays off. > >> > > >> > On Wed, Jan 4, 2017 at 7:43 AM, Bob Haffner via Omaha < > omaha at python.org > >> > > >> > wrote: > >> > > >> >> Yeah, no kidding. That pdf wasn't hard to find and that #1 score is > >> pretty > >> >> damn good > >> >> > >> >> On Tue, Jan 3, 2017 at 10:41 PM, Jeremy Doyle via Omaha < > >> omaha at python.org > > >> >> wrote: > >> >> > >> >>> Looks like we have our key to a score of 0.0. Lol > >> >>> > >> >>> Seriously though, does anyone wonder if the person sitting at #1 had > >> this > >> >>> full data set as well and trained a model using the entire set? I > mean > >> >> that > >> >>> 0.038 score is so much better than anyone else it seems a little > >> >>> unrealistic...or maybe it's just seems that way because I haven't > been > >> >> able > >> >>> to break through 0.12 : ) > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> Sent from my iPhone > >> >>>>> On Jan 3, 2017, at 7:51 PM, Bob Haffner via Omaha < > omaha at python.org > >> > > >> >>>> wrote: > >> >>>> > >> >>>> Pretty interesting notebook I put together regarding the kaggle > comp > >> >>>> https://github.com/bobhaffner/kaggle-houseprices/blob/ > >> >>> master/additional_training_data.ipynb > >> >>>> > >> >>>> On Mon, Jan 2, 2017 at 12:10 AM, Wes Turner via Omaha < > >> >> omaha at python.org > > >> >>>> wrote: > >> >>>> > >> >>>>>> On Wednesday, December 28, 2016, Wes Turner < > wes.turner at gmail.com > > >> >>> wrote: > >> >>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> On Wed, Dec 28, 2016 at 12:56 AM, Jeremy Doyle via Omaha < > >> >>>>> omaha at python.org > >> >>>>>> ');>> > wrote: > >> >>>>>> > >> >>>>>>> Woohoo! We jumped 286 positions with a meager 0.00448 > improvement > >> in > >> >>> our > >> >>>>>>> score! Currently sitting at 798th place. > >> >>>>>> > >> >>>>>> Nice work! Features of your feature engineering I admire: > >> >>>>>> > >> >>>>>> - nominal, ordinal, continuous, discrete > >> >>>>>> categorical = nominal + discrete > >> >>>>>> numeric = continuous + discrete > >> >>>>>> > >> >>>>>> - outlier removal > >> >>>>>> - [ ] w/ constant thresholding? (is there a distribution > parameter) > >> >>>>>> > >> >>>>>> - building datestrings from SaleMonth and YrSold > >> >>>>>> - SaleMonth / "1" / YrSold > >> >>>>>> - df..drop(['MoSold','YrSold','SaleMonth']) > >> >>>>>> - [ ] why drop SaleMonth? > >> >>>>>> - [ ] pandas.to_datetime[df['SaleMonth']) > >> >>>>>> > >> >>>>>> - merging with FHA Home Price Index for the month and region > ("West > >> >>> North > >> >>>>>> Central") > >> >>>>>> https://www.fhfa.gov/DataTools/Downloads/Documents/ > >> >>>>>> HPI/HPI_PO_monthly_hist.xls > >> >>>>>> - [ ] pandas.to_datetime > >> >>>>>> - this should have every month, but the new merge_asof feature > is > >> >>>>>> worth mentioning > >> >>>>>> > >> >>>>>> - manual binarization > >> >>>>>> - [ ] how did you pick these? correlation after pd.get_dummies? > >> >>>>>> - [ ] why floats? 1.0 / 1 (does it make a difference?) > >> >>>>>> > >> >>>>>> - Ames, IA nbrhood_multiplier > >> >>>>>> - http://www.cityofames.org/home/showdocument?id=1024 > >> >>>>>> > >> >>>>>> - feature merging > >> >>>>>> - BsmtFinSF = BsmtFinSF1 + BsmtFinSF2 > >> >>>>>> - TotalBaths = BsmtFullBath + (BsmtHalfBath / 2.0) + FullBath + > >> >>>>>> (HalfBath / 2.0) > >> >>>>>> - ( ) IDK how a feature-selection pipeline could do this > >> >> automatically > >> >>>>>> > >> >>>>>> - null value imputation > >> >>>>>> - .isnull() = 0 > >> >>>>>> - ( ) datacleaner incorrectly sets these to median or mode > >> >>>>>> > >> >>>>>> - log for skewed continuous and SalePrice > >> >>>>>> - ( ) auto_ml: take_log_of_y does this for SalePrice > >> >>>>>> > >> >>>>>> - "Keeping only the columns we want" > >> >>>>>> - [ ] 'Id' shouldn't be relevant (pd.read_csv(filename, > >> >>> index_col='Id') > >> >>>>>> > >> >>>>>> > >> >>>>>> - Binarization > >> >>>>>> - pd.get_dummies(dummy_na=False) > >> >>>>>> - [ ] (a Luke pointed out, concatenation keeps the same columns) > >> >>>>>> rows = eng_train.shape[0] > >> >>>>>> eng_merged = pd.concat(eng_train, eng_test) > >> >>>>>> onehot_merged = pd.get_dummies(eng_merged, columns=nominal, > >> >>>>>> dummy_na=False) > >> >>>>>> onehot_train = eng_merged[:rows] > >> >>>>>> onehot_test = eng_merged[rows:] > >> >>>>>> > >> >>>>>> - class RandomSelectionHelper > >> >>>>>> - [ ] this could be generally helpful in sklean[-pandas] > >> >>>>>> - https://github.com/paulgb/sklearn-pandas#cross-validation > >> >>>>>> > >> >>>>>> - Models to Search > >> >>>>>> - {Ridge, Lasso, ElasticNet} > >> >>>>>> > >> >>>>>> - https://github.com/ClimbsRocks/auto_ml/blob/ > >> >>>>>> master/auto_ml/predictor.py#L222 > >> >>>>>> _get_estimator_names ( "regressor" ) > >> >>>>>> - {XGBRegessor, GradientBoostingRegressor, RANSACRegressor, > >> >>>>>> RandomForestRegressor, LinearRegression, AdaBoostRegressor, > >> >>>>>> ExtraTreesRegressor} > >> >>>>>> > >> >>>>>> - https://github.com/ClimbsRocks/auto_ml/blob/ > >> >>>>>> master/auto_ml/predictor.py#L491 > >> >>>>>> - (w/ ensembling) > >> >>>>>> - ['RandomForestRegressor', 'LinearRegression', > >> >>>>>> 'ExtraTreesRegressor', 'Ridge', 'GradientBoostingRegressor', > >> >>>>>> 'AdaBoostRegressor', 'Lasso', 'ElasticNet', 'LassoLars', > >> >>>>>> 'OrthogonalMatchingPursuit', 'BayesianRidge', 'SGDRegressor'] + > [' > >> >>>>>> XGBRegressor'] > >> >>>>>> > >> >>>>>> - model stacking / ensembling > >> >>>>>> > >> >>>>>> - ( ) auto_ml: https://auto-ml.readthedocs. > >> >>>>> io/en/latest/ensembling.html > >> >>>>>> - ( ) auto-sklearn: > >> >>>>>> https://automl.github.io/auto-sklearn/stable/api.html# > >> >>>>>> autosklearn.regression.AutoSklearnRegressor > >> >>>>>> ensemble_size=50, ensemble_nbest=50 > >> >>>>> > >> >>>>> https://en.wikipedia.org/wiki/Ensemble_learning > >> >>>>> > >> >>>>> http://www.scholarpedia.org/article/Ensemble_learning# > >> >>>>> Ensemble_combination_rules > >> >>>>> > >> >>>>> > >> >>>>>> > >> >>>>>> - submission['SalePrice'] = submission.SalePrice.apply(lambda x: > >> >>>>>> np.exp(x)) > >> >>>>>> > >> >>>>>> - [ ] What is this called / how does this work? > >> >>>>>> - https://docs.scipy.org/doc/numpy/reference/generated/ > >> >>>>> numpy.exp.html > >> >>>>>> > >> >>>>>> - df.to_csv(filename, columns=['SalePrice'], index_label='Id') > also > >> >>> works > >> >>>>>> - http://pandas.pydata.org/pandas-docs/stable/generated/ > >> >>>>>> pandas.DataFrame.to_csv.html > >> >>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>>> My notebook is on GitHub for those interested: > >> >>>>>>> > >> >>>>>>> https://github.com/jeremy-doyle/home_price_kaggle/tree/ > >> >>> master/attempt_4 > >> >>>>>> > >> >>>>>> > >> >>>>>> Thanks! > >> >>>>> > >> >>>>> (Trimmed for 40K limit) > >> >>>>> _______________________________________________ > >> >>>>> Omaha Python Users Group mailing list > >> >>>>> Omaha at python.org > >> >>>>> https://mail.python.org/mailman/listinfo/omaha > >> >>>>> http://www.OmahaPython.org > >> >>>> _______________________________________________ > >> >>>> Omaha Python Users Group mailing list > >> >>>> Omaha at python.org > >> >>>> https://mail.python.org/mailman/listinfo/omaha > >> >>>> http://www.OmahaPython.org > >> >>> > >> >>> _______________________________________________ > >> >>> Omaha Python Users Group mailing list > >> >>> Omaha at python.org > >> >>> https://mail.python.org/mailman/listinfo/omaha > >> >>> http://www.OmahaPython.org > >> >> _______________________________________________ > >> >> Omaha Python Users Group mailing list > >> >> Omaha at python.org > >> >> https://mail.python.org/mailman/listinfo/omaha > >> >> http://www.OmahaPython.org > >> > _______________________________________________ > >> > Omaha Python Users Group mailing list > >> > Omaha at python.org > >> > https://mail.python.org/mailman/listinfo/omaha > >> > http://www.OmahaPython.org > >> _______________________________________________ > >> Omaha Python Users Group mailing list > >> Omaha at python.org > >> https://mail.python.org/mailman/listinfo/omaha > >> http://www.OmahaPython.org > >> > > > > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org From Becky_Brusky at unigroup.com Tue Jan 17 09:09:44 2017 From: Becky_Brusky at unigroup.com (Brusky, Becky) Date: Tue, 17 Jan 2017 14:09:44 +0000 Subject: [omaha] Meeting Location for larger group - next week's meeting In-Reply-To: References: Message-ID: Steve, Do you know if I need to bring any connectors for video display? My demo is on a MacBook. Thanks, Becky On 1/10/17, 8:43 PM, "Omaha on behalf of Hubert Hickman via Omaha" wrote: >The meeting is on Tech Omaha.... > >On Tue, Jan 10, 2017 at 5:48 PM, Bob Haffner via Omaha >wrote: > >> Thanks for hosting, Steve! >> >> Kagglers, let's try to meet following Becky's presentation. >> >> On Tue, Jan 10, 2017 at 3:06 PM, Steve Young >> wrote: >> >> > The meeting will be held at Alley Poyner, in the large conference >>room. >> > It can hold 30+ people, so invite as many people as you would like. >> > >> > Date - Wed, 1/18 (a week from tomorrow) >> > Topic - January Meeting ? Linear Programming with Python and Gurobi >> > Presenter - Becky Brusky >> > Location - 1516 Cuming Street, Alley Poyner Macchietto Architecture >> > >>https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.omaha >>python.org%2Fblog%2Farchives%2Fevent%2Fjanuary-&data=01%7C01%7CBecky_Brus >>ky%40unigroup.com%7Cc1ade994eb3740b9271108d439cb9ae0%7C259bdc2f86d3477b8c >>b34eee64289142%7C1&sdata=9ApavZNQhIBbeNJJQ38FWMegeRMJ05nAaHexVgqrDSw%3D&r >>eserved=0 >> > meeting-linear-programming-with-python-and-gurobi >> > >> > The Kaggle competition group may meet after the presentation. >> > >> > Should be a great meeting - See you there. >> > >> > Steve >> > >> > >> > On Mon, Jan 9, 2017 at 8:06 PM, Bob Haffner >> wrote: >> > >> >> Alley Poyner room would be fine by me >> >> >> >> On Mon, Jan 9, 2017 at 3:52 PM, Steve Young via Omaha >>> > >> >> wrote: >> >> >> >>> I received this request from Becky (who is presenting at next week's >> >>> meeting); >> >>> >> >>> "Dr. Betty Love at UNO would like to invite her two Graduate >>classes, >> >>> Computational OR and Network Programming. There would be about 8 >> >>> students >> >>> if they all showed (which I doubt all would show). Would that be >>okay? >> >>> Becky Brusky >> >>> (I?m presenting the Suduko python what we worked on in Betty?s >>Integer >> >>> Programming class)" >> >>> >> >>> I would like to accommodate her request, but we have the large >> conference >> >>> room booked at DoSpace, and they restrict it to 10 people. I think >>we >> >>> would be over the limit if we had even 4 extra people attend. >> >>> >> >>> The large meeting room at Alley Poyner is available that evening. >> >>> >> >>> Any other ideas? >> >>> >> >>> >>https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.omaha >>python.org%2Fblog%2Farchives%2Fevent%2Fjanuary-meeti&data=01%7C01%7CBecky >>_Brusky%40unigroup.com%7Cc1ade994eb3740b9271108d439cb9ae0%7C259bdc2f86d34 >>77b8cb34eee64289142%7C1&sdata=P1zi8%2FKa2y%2BJuLT4%2B5HAS6zXIqgrw41nTWOpb >>%2FAgolg%3D&reserved=0 >> >>> ng-linear-programming-with-python-and-gurobi?instance_id=32 >> >>> >> >>> Steve >> >>> _______________________________________________ >> >>> Omaha Python Users Group mailing list >> >>> Omaha at python.org >> >>> >>https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.pyt >>hon.org%2Fmailman%2Flistinfo%2Fomaha&data=01%7C01%7CBecky_Brusky%40unigro >>up.com%7Cc1ade994eb3740b9271108d439cb9ae0%7C259bdc2f86d3477b8cb34eee64289 >>142%7C1&sdata=my68Qo%2BWzGHTKX8JK65po%2FSvf%2B%2Bo0H6T%2Fp%2Bhopw%2BKu8%3 >>D&reserved=0 >> >>> >>https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.Omaha >>Python.org&data=01%7C01%7CBecky_Brusky%40unigroup.com%7Cc1ade994eb3740b92 >>71108d439cb9ae0%7C259bdc2f86d3477b8cb34eee64289142%7C1&sdata=4NZN521r4Tos >>H%2FJo4CNy3YhDK9Q01I0AYpfJ518Xzks%3D&reserved=0 >> >> >> >> >> >> >> > >> _______________________________________________ >> Omaha Python Users Group mailing list >> Omaha at python.org >> >>https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.pyt >>hon.org%2Fmailman%2Flistinfo%2Fomaha&data=01%7C01%7CBecky_Brusky%40unigro >>up.com%7Cc1ade994eb3740b9271108d439cb9ae0%7C259bdc2f86d3477b8cb34eee64289 >>142%7C1&sdata=my68Qo%2BWzGHTKX8JK65po%2FSvf%2B%2Bo0H6T%2Fp%2Bhopw%2BKu8%3 >>D&reserved=0 >> >>https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.Omaha >>Python.org&data=01%7C01%7CBecky_Brusky%40unigroup.com%7Cc1ade994eb3740b92 >>71108d439cb9ae0%7C259bdc2f86d3477b8cb34eee64289142%7C1&sdata=4NZN521r4Tos >>H%2FJo4CNy3YhDK9Q01I0AYpfJ518Xzks%3D&reserved=0 >> >_______________________________________________ >Omaha Python Users Group mailing list >Omaha at python.org >https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.pyth >on.org%2Fmailman%2Flistinfo%2Fomaha&data=01%7C01%7CBecky_Brusky%40unigroup >.com%7Cc1ade994eb3740b9271108d439cb9ae0%7C259bdc2f86d3477b8cb34eee64289142 >%7C1&sdata=my68Qo%2BWzGHTKX8JK65po%2FSvf%2B%2Bo0H6T%2Fp%2Bhopw%2BKu8%3D&re >served=0 >https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.OmahaP >ython.org&data=01%7C01%7CBecky_Brusky%40unigroup.com%7Cc1ade994eb3740b9271 >108d439cb9ae0%7C259bdc2f86d3477b8cb34eee64289142%7C1&sdata=4NZN521r4TosH%2 >FJo4CNy3YhDK9Q01I0AYpfJ518Xzks%3D&reserved=0 ######################################################################## The information contained in this message, and any attachments thereto, is intended solely for the use of the addressee(s) and may contain confidential and/or privileged material. Any review, retransmission, dissemination, copying, or other use of the transmitted information is prohibited. If you received this in error, please contact the sender and delete the material from any computer. UNIGROUP.COM ######################################################################## From wereapwhatwesow at gmail.com Tue Jan 17 09:23:18 2017 From: wereapwhatwesow at gmail.com (Steve Young) Date: Tue, 17 Jan 2017 08:23:18 -0600 Subject: [omaha] Meeting Location for larger group - next week's meeting In-Reply-To: References: Message-ID: We have hdmi, mini-hdmi, and mini-displayport connections. I think one of them will work with most macbook models. On Tue, Jan 17, 2017 at 8:09 AM, Brusky, Becky via Omaha wrote: > Steve, > > Do you know if I need to bring any connectors for video display? My demo > is on a MacBook. > > Thanks, > Becky > > On 1/10/17, 8:43 PM, "Omaha on behalf of Hubert Hickman via Omaha" > omaha at python.org> wrote: > > >The meeting is on Tech Omaha.... > > > >On Tue, Jan 10, 2017 at 5:48 PM, Bob Haffner via Omaha > >wrote: > > > >> Thanks for hosting, Steve! > >> > >> Kagglers, let's try to meet following Becky's presentation. > >> > >> On Tue, Jan 10, 2017 at 3:06 PM, Steve Young > > >> wrote: > >> > >> > The meeting will be held at Alley Poyner, in the large conference > >>room. > >> > It can hold 30+ people, so invite as many people as you would like. > >> > > >> > Date - Wed, 1/18 (a week from tomorrow) > >> > Topic - January Meeting ? Linear Programming with Python and Gurobi > >> > Presenter - Becky Brusky > >> > Location - 1516 Cuming Street, Alley Poyner Macchietto Architecture > >> > > >>https://na01.safelinks.protection.outlook.com/?url= > http%3A%2F%2Fwww.omaha > >>python.org%2Fblog%2Farchives%2Fevent%2Fjanuary-& > data=01%7C01%7CBecky_Brus > >>ky%40unigroup.com%7Cc1ade994eb3740b9271108d439cb > 9ae0%7C259bdc2f86d3477b8c > >>b34eee64289142%7C1&sdata=9ApavZNQhIBbeNJJQ38FWMegeRMJ05 > nAaHexVgqrDSw%3D&r > >>eserved=0 > >> > meeting-linear-programming-with-python-and-gurobi > >> > > >> > The Kaggle competition group may meet after the presentation. > >> > > >> > Should be a great meeting - See you there. > >> > > >> > Steve > >> > > >> > > >> > On Mon, Jan 9, 2017 at 8:06 PM, Bob Haffner > >> wrote: > >> > > >> >> Alley Poyner room would be fine by me > >> >> > >> >> On Mon, Jan 9, 2017 at 3:52 PM, Steve Young via Omaha > >> >> > > >> >> wrote: > >> >> > >> >>> I received this request from Becky (who is presenting at next week's > >> >>> meeting); > >> >>> > >> >>> "Dr. Betty Love at UNO would like to invite her two Graduate > >>classes, > >> >>> Computational OR and Network Programming. There would be about 8 > >> >>> students > >> >>> if they all showed (which I doubt all would show). Would that be > >>okay? > >> >>> Becky Brusky > >> >>> (I?m presenting the Suduko python what we worked on in Betty?s > >>Integer > >> >>> Programming class)" > >> >>> > >> >>> I would like to accommodate her request, but we have the large > >> conference > >> >>> room booked at DoSpace, and they restrict it to 10 people. I think > >>we > >> >>> would be over the limit if we had even 4 extra people attend. > >> >>> > >> >>> The large meeting room at Alley Poyner is available that evening. > >> >>> > >> >>> Any other ideas? > >> >>> > >> >>> > >>https://na01.safelinks.protection.outlook.com/?url= > http%3A%2F%2Fwww.omaha > >>python.org%2Fblog%2Farchives%2Fevent%2Fjanuary- > meeti&data=01%7C01%7CBecky > >>_Brusky%40unigroup.com%7Cc1ade994eb3740b9271108d439cb > 9ae0%7C259bdc2f86d34 > >>77b8cb34eee64289142%7C1&sdata=P1zi8%2FKa2y%2BJuLT4% > 2B5HAS6zXIqgrw41nTWOpb > >>%2FAgolg%3D&reserved=0 > >> >>> ng-linear-programming-with-python-and-gurobi?instance_id=32 > >> >>> > >> >>> Steve > >> >>> _______________________________________________ > >> >>> Omaha Python Users Group mailing list > >> >>> Omaha at python.org > >> >>> > >>https://na01.safelinks.protection.outlook.com/?url= > https%3A%2F%2Fmail.pyt > >>hon.org%2Fmailman%2Flistinfo%2Fomaha&data=01% > 7C01%7CBecky_Brusky%40unigro > >>up.com%7Cc1ade994eb3740b9271108d439cb9ae0%7C259bdc2f86d3477b8cb34eee6428 > 9 > >>142%7C1&sdata=my68Qo%2BWzGHTKX8JK65po%2FSvf%2B% > 2Bo0H6T%2Fp%2Bhopw%2BKu8%3 > >>D&reserved=0 > >> >>> > >>https://na01.safelinks.protection.outlook.com/?url= > http%3A%2F%2Fwww.Omaha > >>Python.org&data=01%7C01%7CBecky_Brusky%40unigroup.com% > 7Cc1ade994eb3740b92 > >>71108d439cb9ae0%7C259bdc2f86d3477b8cb34eee6428 > 9142%7C1&sdata=4NZN521r4Tos > >>H%2FJo4CNy3YhDK9Q01I0AYpfJ518Xzks%3D&reserved=0 > >> >> > >> >> > >> >> > >> > > >> _______________________________________________ > >> Omaha Python Users Group mailing list > >> Omaha at python.org > >> > >>https://na01.safelinks.protection.outlook.com/?url= > https%3A%2F%2Fmail.pyt > >>hon.org%2Fmailman%2Flistinfo%2Fomaha&data=01% > 7C01%7CBecky_Brusky%40unigro > >>up.com%7Cc1ade994eb3740b9271108d439cb9ae0%7C259bdc2f86d3477b8cb34eee6428 > 9 > >>142%7C1&sdata=my68Qo%2BWzGHTKX8JK65po%2FSvf%2B% > 2Bo0H6T%2Fp%2Bhopw%2BKu8%3 > >>D&reserved=0 > >> > >>https://na01.safelinks.protection.outlook.com/?url= > http%3A%2F%2Fwww.Omaha > >>Python.org&data=01%7C01%7CBecky_Brusky%40unigroup.com% > 7Cc1ade994eb3740b92 > >>71108d439cb9ae0%7C259bdc2f86d3477b8cb34eee6428 > 9142%7C1&sdata=4NZN521r4Tos > >>H%2FJo4CNy3YhDK9Q01I0AYpfJ518Xzks%3D&reserved=0 > >> > >_______________________________________________ > >Omaha Python Users Group mailing list > >Omaha at python.org > >https://na01.safelinks.protection.outlook.com/?url= > https%3A%2F%2Fmail.pyth > >on.org%2Fmailman%2Flistinfo%2Fomaha&data=01%7C01%7CBecky_ > Brusky%40unigroup > >.com%7Cc1ade994eb3740b9271108d439cb9ae0%7C259bdc2f86d3477b8cb34eee6428 > 9142 > >%7C1&sdata=my68Qo%2BWzGHTKX8JK65po%2FSvf%2B% > 2Bo0H6T%2Fp%2Bhopw%2BKu8%3D&re > >served=0 > >https://na01.safelinks.protection.outlook.com/?url= > http%3A%2F%2Fwww.OmahaP > >ython.org&data=01%7C01%7CBecky_Brusky%40unigroup.com% > 7Cc1ade994eb3740b9271 > >108d439cb9ae0%7C259bdc2f86d3477b8cb34eee6428 > 9142%7C1&sdata=4NZN521r4TosH%2 > >FJo4CNy3YhDK9Q01I0AYpfJ518Xzks%3D&reserved=0 > > > ######################################################################## > The information contained in this message, and any attachments thereto, > is intended solely for the use of the addressee(s) and may contain > confidential and/or privileged material. Any review, retransmission, > dissemination, copying, or other use of the transmitted information is > prohibited. If you received this in error, please contact the sender > and delete the material from any computer. UNIGROUP.COM > ######################################################################## > > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org > From wereapwhatwesow at gmail.com Tue Jan 17 16:45:35 2017 From: wereapwhatwesow at gmail.com (Steve Young) Date: Tue, 17 Jan 2017 15:45:35 -0600 Subject: [omaha] Meeting tomorrow @ 6:30pm Message-ID: Topic: Becky is presenting on Linear Programming with Python and Gurobi. Kaggle competition meetup at end of regular meeting. Time: 6:30-8pm Location: Alley Poyner Macchietto Architecture, 1516 Cuming Street, Omaha http://www.omahapython.org/blog/archives/event/january-meeting-linear-programming-with-python-and-gurobi?instance_id=32 Should be a great meeting. See you there. Steve From wereapwhatwesow at gmail.com Mon Jan 23 18:07:22 2017 From: wereapwhatwesow at gmail.com (Steve Young) Date: Mon, 23 Jan 2017 17:07:22 -0600 Subject: [omaha] Winter/Spring 2017 meeting topics and speakers Message-ID: We had a good meeting last week - thanks Becky and the others who attended. I am amazed at how many Python libraries are available that simplify complex programming tasks. February is scheduled - Hubert Hickman, Victor Winter, Betty Love, presenting on Building the Bricklayer ID E, at the DoSpace, Meeting Room 1. I would love to start getting some topics and presenters scheduled for the next few months. Now is your chance to have your time in the spotlight or request a topic for someone else to present. Bob H has offered to present on Flask, but we have not picked a date for it yet. March April May June Just reply to the thread with your ideas. Thanks. Steve From bob.haffner at gmail.com Sun Jan 29 20:14:20 2017 From: bob.haffner at gmail.com (Bob Haffner) Date: Sun, 29 Jan 2017 19:14:20 -0600 Subject: [omaha] Winter/Spring 2017 meeting topics and speakers In-Reply-To: References: Message-ID: Steve, I'm still game to do the Microservices with Flask talk. I can take the March slot if no one else does. All, please consider giving a talk or perhaps leading a discussion. Doesn't have to be lengthy or elaborate. We're an easygoing bunch :-) Bob On Mon, Jan 23, 2017 at 5:07 PM, Steve Young via Omaha wrote: > We had a good meeting last week - thanks Becky and the others who > attended. I am amazed at how many Python libraries are available that > simplify complex programming tasks. > > February is scheduled - Hubert Hickman, Victor Winter, Betty Love, > presenting on Building the Bricklayer ID > the-bricklayer-ide?instance_id=35>E, > at the DoSpace, Meeting Room 1. > > I would love to start getting some topics and presenters scheduled for the > next few months. > > Now is your chance to have your time in the spotlight or request a topic > for someone else to present. Bob H has offered to present on Flask, but we > have not picked a date for it yet. > > March > April > May > June > > Just reply to the thread with your ideas. Thanks. > > Steve > _______________________________________________ > Omaha Python Users Group mailing list > Omaha at python.org > https://mail.python.org/mailman/listinfo/omaha > http://www.OmahaPython.org >