[SciPy-User] R vs Python for simple interactive data analysis

josef.pktd at gmail.com josef.pktd at gmail.com
Mon Aug 29 11:42:32 EDT 2011


On Mon, Aug 29, 2011 at 11:34 AM, Christopher Jordan-Squire
<cjordan1 at uw.edu> wrote:
> On Mon, Aug 29, 2011 at 10:27 AM,  <josef.pktd at gmail.com> wrote:
>> On Mon, Aug 29, 2011 at 11:10 AM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>> On Mon, Aug 29, 2011 at 10:57 AM, Christopher Jordan-Squire
>>> <cjordan1 at uw.edu> wrote:
>>>> On Sun, Aug 28, 2011 at 2:54 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>>>> On Sat, Aug 27, 2011 at 10:15 PM, Bruce Southey <bsouthey at gmail.com> wrote:
>>>>>> On Sat, Aug 27, 2011 at 5:06 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>>>>>>> On Sat, Aug 27, 2011 at 5:03 PM, Jason Grout
>>>>>>> <jason-sage at creativetrax.com> wrote:
>>>>>>>> On 8/27/11 1:19 PM, Christopher Jordan-Squire wrote:
>>>>>>>>> This comparison might be useful to some people, so I stuck it up on a
>>>>>>>>> github repo. My overall impression is that R is much stronger for
>>>>>>>>> interactive data analysis. Click on the link for more details why,
>>>>>>>>> which are summarized in the README file.
>>>>>>>>
>>>>>>>>  From the README:
>>>>>>>>
>>>>>>>> "In fact, using Python without the IPython qtconsole is practically
>>>>>>>> impossible for this sort of cut and paste, interactive analysis.
>>>>>>>> The shell IPython doesn't allow it because it automatically adds
>>>>>>>> whitespace on multiline bits of code, breaking pre-formatted code's
>>>>>>>> alignment. Cutting and pasting works for the standard python shell,
>>>>>>>> but then you lose all the advantages of IPython."
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> You might use %cpaste in the ipython normal shell to paste without it
>>>>>>>> automatically inserting spaces:
>>>>>>>>
>>>>>>>> In [5]: %cpaste
>>>>>>>> Pasting code; enter '--' alone on the line to stop.
>>>>>>>> :if 1>0:
>>>>>>>> :    print 'hi'
>>>>>>>> :--
>>>>>>>> hi
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Jason
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> SciPy-User mailing list
>>>>>>>> SciPy-User at scipy.org
>>>>>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>>>>>
>>>>>>>
>>>>>>> This strikes me as a textbook example of why we need an integrated
>>>>>>> formula framework in statsmodels. I'll make a pass through when I get
>>>>>>> a chance and see if there are some places where pandas would really
>>>>>>> help out.
>>>>>>
>>>>>> We used to have a formula class is scipy.stats and I do not follow
>>>>>> nipy (http://nipy.sourceforge.net/nipy/stable/index.html) as it also
>>>>>> had this (extremely flexible but very hard to comprehend). It was what
>>>>>> I had argued was needed ages ago for statsmodel. But it needs a
>>>>>> community effort because the syntax required serves multiple
>>>>>> communities with different annotations and needs. That is also seen
>>>>>> from the different approaches taken by the stats packages from S/R,
>>>>>> SAS, Genstat (and those are just are ones I have used).
>>>>>>
>>>>>
>>>>> We have held this discussion at _great_ length multiple times on the
>>>>> statsmodels list and are in the process of trying to integrate
>>>>> Charlton (from Nathaniel) and/or Formula (from Jonathan / NiPy) into
>>>>> the statsmodels base.
>>>>>
>>>>> http://statsmodels.sourceforge.net/dev/roadmap_todo.html#formula-framework
>>>>>
>>>>> and more recently
>>>>>
>>>>> https://groups.google.com/group/pystatsmodels/browse_thread/thread/a76ea5de9e96964b/fd85b80ae46c4931?
>>>>>
>>>>> https://github.com/statsmodels/formula
>>>>> https://github.com/statsmodels/charlton
>>>>>
>>>>> Wes and I made some effort to go through this at SciPy. From where I
>>>>> sit, I think it's difficult to disentangle the data structures from
>>>>> the formula implementation, or maybe I'd just prefer to finish
>>>>> tackling the former because it's much more straightforward. So I'd
>>>>> like to first finish the pandas-integration branch that we've started
>>>>> and then focus on the formula support. This is on my (our, I hope...)
>>>>> immediate long-term goal list. Then I'd like to come back to the
>>>>> community and hash out the 'rules of the game' details for formulas
>>>>> after we have some code for people to play with, which promises to be
>>>>> "fun."
>>>>>
>>>>> https://github.com/statsmodels/statsmodels/tree/pandas-integration
>>>>>
>>>>> FWIW, I could also improve the categorical function to be much nicer
>>>>> for the given examples (ie., take a list, drop a reference category),
>>>>> but I don't know that it's worth it, because it's really just a
>>>>> stop-gap and ideally users shouldn't have to rely on it. Thoughts on
>>>>> more stop-gap?
>>>>>
>>>>
>>>> I want more usability, but I agree that a stop-gap probably isn't the
>>>> right way to go, unless it has things we'd eventually want anyways.
>>>>
>>>>> If I understand Chris' concerns, I think pandas + formula will go a
>>>>> long way towards bridging the gap between Python and R usability, but
>>>>
>>>> Yes, I agree. pandas + formulas would go a long, long way towards more
>>>> usability.
>>>>
>>>> Though I really, really want a scatterplot smoother (i.e., lowess) in
>>>> statsmodels. I use it a lot, and the final part of my R file was
>>>> entirely lowess. (And, I should add, that was the part people liked
>>>> best since one of the main goals of the assignment was to generate
>>>> nifty pictures that could be used to summarize the data.)
>>>>
>>>
>>> Working my way through the pull requests. Very time poor...
>>>
>>>>> it's a large effort and there are only a handful (at best) of people
>>>>> writing code -- Wes being the only one who's more or less "full time"
>>>>> as far as I can tell. The 0.4 statsmodels release should be very
>>>>> exciting though, I hope. I'm looking forward to it, at least. Then
>>>>> there's only the small problem of building an infrastructure and
>>>>> community like CRAN so we can have specialists writing and maintaining
>>>>> code...but I hope once all the tools are in place this will seem much
>>>>> less daunting. There certainly seems to be the right sentiment for it.
>>>>>
>>>>
>>>> At the very least creating and testing models would be much simpler.
>>>> For weeks I've been wanting to see if gmm is the same as gee by
>>>> fitting both models to the same dataset, but I've been putting it off
>>>> because I didn't want to construct the design matrices by hand for
>>>> such a simple question. (GMM--Generalized Method of Moments--is a
>>>> standard econometrics model and GEE--Generalized Estimating
>>>> Equations--is a standard biostatics model. They're both
>>>> generalizations of quasi-likelihood and appear very similar, but I
>>>> want to fit some models to figure out if they're exactly the same.)
>>
>> Since GMM is still in the sandbox, the interface is not very polished,
>> and it's missing some enhancements. I recommend asking on the mailing
>> list if it's not clear.
>>
>> Note GMM itself is very general and will never be a quick interactive
>> method. The main work will always be to define the moment conditions
>> (a bit similar to non-linear function estimation, optimize.leastsq).
>>
>> There are and will be special subclasses, eg. IV2SLS, that have
>> predefined moment conditions, but, still, it's up to the user do
>> construct design and instrument arrays.
>> And as far as I remember, the GMM/GEE package in R doesn't have a
>> formula interface either.
>>
>
> Both of the two gee packages in R I know of have formula interfaces.
>
> http://cran.r-project.org/web/packages/geepack/
> http://cran.r-project.org/web/packages/gee/index.html

I have to look at this. I mixed up some acronyms, I meant GEL and GMM
http://cran.r-project.org/web/packages/gmm/index.html
the vignette was one of my readings, and the STATA description for GMM.

I never really looked at GEE. (That's Skipper's private work so far.)

Josef

>
> -Chris JS
>
>> Josef
>>
>>>>
>>>
>>> Oh, it's not *that* bad. I agree, of course, that it could be better,
>>> but I've been using mainly Python for my work, including GMM and
>>> estimating equations models (mainly empirical likelihood and
>>> generalized maximum entropy) for the last ~two years.
>>>
>>> Skipper
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>



More information about the SciPy-User mailing list