[omaha] Group Data Science Competition

Sun Dec 18 11:59:49 EST 2016

Thanks, Bob!

On Sun, Dec 18, 2016 at 9:26 AM, Bob Haffner <bob.haffner at gmail.com> wrote:

> Nice job, Wes!!
>
> On Sun, Dec 18, 2016 at 4:11 AM, Wes Turner <wes.turner at gmail.com> wrote:
>
>> In addition to posting to the mailing list, I created a comment on the
>> "Kaggle Submissions" issue [1]:
>>
>> - Score: 0.13667 (#1370)
>>>   - https://www.kaggle.com/c/house-prices-advanced-regression-
>>> techniques/leaderboard?submissionId=3925119
>>>   - https://mail.python.org/pipermail/omaha/2016-December/002206.html
>>>   - https://github.com/westurner/house_prices/blob/2839ff8a/hous
>>> e_prices/pipelines/tpot_house_prices__001__modified.py
>>
>>
>> [1] https://github.com/omahapython/kaggle-houseprices/issues/2
>>
>> On Sun, Dec 18, 2016 at 3:45 AM, Wes Turner <wes.turner at gmail.com> wrote:
>>
>>> Sounds great. 1/18.
>>>
>>> I just submitted my first submission.csv to Kaggle! [1]
>>>
>>> $ python ./tpot_house_prices__001__modified.py
>>> class_sum: 264144946
>>> abs error: 5582809.288
>>> % error:   2.11354007432 %
>>> error**2:  252508654837.0
>>> #  python ./tpot_house_prices__001__modified.py
>>>
>>>
>>> ... Which moves us up to #1370!
>>>
>>> Your Best Entry ↑
>>> You improved on your best score by 0.02469.
>>> You just moved up 608 positions on the leaderboard.
>>>
>>>
>>> I have a few more things to try:
>>>
>>>
>>>    - Manually drop the 'Id' column
>>>    - do_get_dummies=True (data.py) + EC2 m4.4xlarge instance
>>>       - I got an oom error w/ an 8GB notebook (at 25/120 w/ verbosity=2)
>>>       - https://github.com/westurner/house_prices/blob/2839ff8a/hous
>>>       e_prices/data.py#L94
>>>       - skleanGridSearch and/or sklearn-deap the TPOT hyperparameters
>>>       - http://scikit-learn.org/stable/modules/generated/sklearn.mod
>>>       el_selection.GridSearchCV.html#sklearn.model_selection.GridS
>>>       earchCV
>>>       <http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV>
>>>       - https://github.com/rsteca/sklearn-deap
>>>    - REF,BLD,DOC,TST:
>>>       - factor constants out in favor of settings.json and data.py
>>>          - https://github.com/omahapython/kaggle-houseprices/blob/maste
>>>          r/src/data.py
>>>       - implement train.py and predict.py, too
>>>       - create a Dockerfile FROM kaggle/docker-python:latest
>>>          - https://github.com/omahapython/datascience/issues/3 "Kaggle
>>>          Best Practices"
>>>       - docstrings, tests
>>>    - https://github.com/omahapython/datascience/wiki/resources
>>>
>>> [1] https://github.com/westurner/house_prices/blob/2839ff8a/hous
>>> e_prices/pipelines/tpot_house_prices__001__modified.py
>>>
>>> On Sat, Dec 17, 2016 at 4:39 PM, Bob Haffner via Omaha <omaha at python.org
>>> > wrote:
>>>
>>>> Hey all, regarding our January kaggle meetup that we talked about.
>>>> Maybe
>>>> we can meet following our regular monthly (1/18).
>>>>
>>>> Would that be easier/better for everyone?
>>>>
>>>> On Sat, Dec 17, 2016 at 4:34 PM, Bob Haffner <bob.haffner at gmail.com>
>>>> wrote:
>>>>
>>>> > Just submitted another Linear Regression attempt (0.16136).  Added
>>>> some
>>>> > features, both numeric and categorical, and created 3 numerics
>>>> >
>>>> > -TotalFullBaths
>>>> > -TotalHalfBaths
>>>> > -Pool
>>>> >
>>>> > Notebook attached
>>>> >
>>>> >
>>>> >
>>>> > On Sat, Dec 17, 2016 at 4:21 PM, Bob Haffner <bob.haffner at gmail.com>
>>>> > wrote:
>>>> >
>>>> >> Just submitted another Linear Regression attempt (0.16136).  Added
>>>> some
>>>> >> features, both numeric and categorical, and created 3 numerics
>>>> >>
>>>> >> -TotalFullBaths
>>>> >> -TotalHalfBaths
>>>> >> -Pool
>>>> >>
>>>> >> Notebook attached
>>>> >>
>>>> >> On Sat, Dec 17, 2016 at 3:28 PM, Wes Turner <wes.turner at gmail.com>
>>>> wrote:
>>>> >>
>>>> >>>
>>>> >>>
>>>> >>> On Sat, Dec 17, 2016 at 3:25 PM, Wes Turner <wes.turner at gmail.com>
>>>> >>> wrote:
>>>> >>>
>>>> >>>>
>>>> >>>>
>>>> >>>> On Sat, Dec 17, 2016 at 2:39 PM, Bob Haffner via Omaha <
>>>> >>>> omaha at python.org> wrote:
>>>> >>>>
>>>> >>>>> >Does Kaggle take the high mark but still give a score for each
>>>> >>>>> submission?
>>>> >>>>> Yes.
>>>> >>>>> https://www.kaggle.com/c/house-prices-advanced-regression-te
>>>> >>>>> chniques/submissions
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> >Thinking of ways to keep track of which code produced which
>>>> score;
>>>> >>>>> I'll
>>>> >>>>> >post about the GitHub setup in a bit.
>>>> >>>>> We could push our notebooks to the github repo?  Maybe include a
>>>> brief
>>>> >>>>> description at the top in a markdown cell
>>>> >>>>>
>>>> >>>>
>>>> >>>> In my research [1], I found that the preferred folder structure for
>>>> >>>> kaggle is input/ (data), src/ (.py, .ipnb notebooks), and working/
>>>> >>>> (outputs);
>>>> >>>> and that they recommend creating a settings.json with path
>>>> >>>> configuration (e.g. pointing to input/, src/ data/)
>>>> >>>>
>>>> >>>> So, we could put notebooks, folders, and repos in src/ [2].
>>>> >>>>
>>>> >>>> runipy is a bit more scriptable than requiring notebook gui
>>>> >>>> interactions [3].
>>>> >>>>
>>>> >>>> We could either hardcode '../input/test.csv' in our .py and .ipnb
>>>> >>>> sources, or we could write a function in src/data.py to read
>>>> >>>> '../settings.json' into a dict with the recommended variable names
>>>> [1]:
>>>> >>>>
>>>> >>>>     from data import read_settings_json
>>>> >>>>     settings = read_settings_json()
>>>> >>>>     train = pd.read_csv(settings['TRAIN_DATA_PATH'])
>>>> >>>>     # ....
>>>> >>>>     pd.write_csv(settings['SUBMISSION_PATH'])
>>>> >>>>
>>>> >>>> [1] https://github.com/omahapython/datascience/issues/3#issuecom
>>>> >>>> ment-267236556
>>>> >>>> [2] https://github.com/omahapython/kaggle-houseprices/tree/maste
>>>> r/src
>>>> >>>> [3] https://pypi.python.org/pypi/runipy
>>>> >>>>
>>>> >>>>
>>>> >>>>>
>>>> >>>>> I initially thought github was a good way to go, but I don't know
>>>> if
>>>> >>>>> everyone has a github acct or is interested in starting one.
>>>>  Maybe
>>>> >>>>> email
>>>> >>>>> is the way to go?
>>>> >>>>>
>>>> >>>>
>>>> >>>> I'm all for GitHub:
>>>> >>>>
>>>> >>>> - git source control and revision numbers
>>>> >>>> - we're not able to easily share code in the mailing list
>>>> >>>> - we can learn from each others' solutions
>>>> >>>>
>>>> >>>
>>>> >>> An example of mailing list limitations:
>>>> >>>
>>>> >>>
>>>> >>> Your mail to 'Omaha' with the subject
>>>> >>>
>>>> >>>     Re: [omaha] Group Data Science Competition
>>>> >>>
>>>> >>> Is being held until the list moderator can review it for approval.
>>>> >>>
>>>> >>> The reason it is being held:
>>>> >>>
>>>> >>>     Message body is too big: 47004 bytes with a limit of 40 KB
>>>> >>>
>>>> >>>  (I trimmed out the reply chain; so this may make it through first)
>>>> >>>
>>>> >>
>>>> >>
>>>> >
>>>> _______________________________________________
>>>> Omaha Python Users Group mailing list
>>>> Omaha at python.org
>>>> https://mail.python.org/mailman/listinfo/omaha
>>>> http://www.OmahaPython.org
>>>>
>>>
>>>
>>
>