[IPython-dev] [sympy] Re: using reST for representing the notebook cells+text

Thu Feb 25 13:32:54 EST 2010

I would expect that parsing an XML notebook would be faster than
importing/exec'ing a pure python based
notebook (as Robert's tests shows).  It is not clear to me that
performance is the central concern, but it could be
in some situations.  We all know that long import times are an issue
for some projects.

Robert, do you have any thoughts about the validation/extensibility
issues related to XML that I brought up?

I think at this point, those are my main concerns with XML versus pure python.

Cheers,

Brian

On Wed, Feb 24, 2010 at 3:47 PM, Robert Kern <robert.kern at gmail.com> wrote:
> On Wed, Feb 24, 2010 at 17:04, Mikhail Terekhov <termim at gmail.com> wrote:
>> On Wed, Feb 24, 2010 at 4:04 PM, Robert Kern <robert.kern at gmail.com> wrote:
>>>
>>> I am almost certain that their use cases and workloads are much
>>> different than the notebook's would be. Python's parser isn't exactly
>>> a speed demon, either. A general statement like "XML is slow" followed
>>> by an unrelated anecdote is not terribly convincing. Show me
>>> experiments. I've attached mine. Python ends up being about 3 times
>>> slower than the equivalent XML for a variety of file sizes.
>>
>> Believe it or not, I can't find any example of a python project that started
>> implementing scientific notebook as an XML document and then switched to
>> something else :) I've used a "scientific analogy" principle. Seriously,
>> the Subversion is a real project and they really suffered from the decision to
>> use XML as a storage for the workspace meta data and they really switched away
>> from XML. No anecdotes.
>
> Yes, that is an anecdote. Anecdotes are true stories, but they are not
> convincing data. This comparison is by no means scientific. There are
> large differences between the use cases here. The simple text format
> that Subversion moved to is not comparable to the Python format being
> discussed here. Subversion faced different read/write loads than a
> notebook will. I'm happy to consider other examples of projects
> finding that their XML parsers were too slow, but you need to make a
> more considered argument that the circumstances are similar enough to
> the one we are talking about.
>
> XML is not slow. XML is not fast. It cannot be either thing because it
> is a file format, not an implementation. Parsers can be slow or fast.
> cElementTree is particularly fast and rather faster than the Python
> parser on equivalent data.
>
>> IMHO the relation is quite simple - the things like Mathematica's
>> notebooks tend to
>> multiply and form libraries or collections. In this case XML parsing
>> could became a
>> problem.
>
> I'm sorry, but this does not follow.
>
>> Your example is not quite correct but it is a good start :} It
>> actually illustrates two
>> important points. First is that writing serializer that produces XML
>> representation
>> is easy. The second and more important is that after parsing XML you've got
>> nothing but internal XML representation and the only thing you can do with it is
>> to write back to a file, you still need to implement all the
>> functionality as a python
>> objects and convert the XML tree into python objects tree. Only after
>> that there
>> will be any point in benchmarking. At the contrary, the python's
>> version is complete
>> and if you had the real Notebook implementation then you were just
>> ready to use it.
>> Also note that there is no need to write, debug and support _any_
>> parser/reader,
>> python provides that for free.
>
> I intended to measure the performance of the parsing, not the
> difficulty of implementation of anything else. The performance of
> constructing the objects is rather smaller, at least for XML. It is a
> negligible cost on top of the XML parsing, but executing the Python
> code actually imposes a much larger burden on the Python
> implementation. See the attached updated benchmark. The Python method
> now takes about 4x as much time to construct the Notebook instance
> than the XML.
>
>>> I'm not talking about other projects adopting anything. I'm talking
>>> about basic capabilities of other languages, like JavaScript's builtin
>>> support for parsing XML. That enables *us* to build things in
>>> JavaScript.
>>>
>>>> BTW the fact that
>>>> everyone can parse XML doesn't mean that every one can _use_ the
>>>> data right away.
>>>
>>> Nor am I saying that. I am saying that it is enormously easier to
>>> build the JavaScript parser for the XML representation rather than the
>>> Python one.
>>
>> That is the real question - why JavaScript needs to read _interenal_
>> representation of the nb if it is not going to implement all the
>> needed functionality
>> to use it?
>
> File formats are not internal. They are external. They are the primary
> method of interchange.
>
>>>> One have to have an internal logic/library/API specific
>>>> to the data represented by some particular XML document. If you take
>>>> this into account then the value of the exchange document format
>>>> somewhat reduces. It is still not zero though and IMHO it is easy to
>>>> teach classes proposed by Brian to produce XML representation just
>>>> for the mythical interchange with something :)
>>>
>>> The need for interchange is not at all mythical. Web front ends are
>>> exactly what we are talking about in this thread.
>>
>> Sure, and It looks like in his very interesting approach the
>> JavaScript part is a
>> client that queries python server for information about nb it needs and there is
>> no need for JS to read nb or even know how it is stored on disk.
>>
>> More general: internal representation does not have to be tightly coupled to
>> interfaces to external systems.
>
> Exactly. The Python code file format is a greater coupling to the
> internal implementation, not a looser one.
>
>> Simplicity and reliability of the
>> internal representation
>> (in this case - just a regular python compiler versus custom XML
>> parser) outweighs
>> the need to write relatively simple export/interface functions that
>> give a view on the nb.
>
> I disagree with that judgement and with the characterization of the
> Python format as more simple and reliable.
>
>> As Ondrej's work shows they are needed anyway and of coarse they can use XML if
>> it is easy for the client.
>>
>>>>> JavaScript being the hugely important player here. Certainly, you are
>>>>
>>>> Again, it is important to define to what degree the interoperability with
>>>> something like JavaScript is needed. If you plan to work on/modify/execute
>>>> the same nbs in Python and in JavaScript then you have to implement
>>>> compatible engine/API in Python _and_ in JavaScript. Are you sure you
>>>> want to do that? If only the representation or "computed" notebook is
>>>> needed for display purposes by JavaScript, then it is something different
>>>> and could be implemented through specialized repr methods.
>>>
>>> Or you could use the same mechanism for both instead of duplicating efforts.
>>
>> Unfortunately one have to duplicate something in either case. nb->XML would
>> duplicate nb->repr, but as your example shows the nb->XML is quite
>> straightforward.
>> In case XML->nb one have to duplicate python compiler which is unnecessary
>> in case repr->nb.
>
> Honestly, it's pretty trivial stuff.
>
>>>>> going to have a Python API that will represent that tree of text nodes
>>>>> as Python objects, but I just don't see the point of making the repr()
>>>>> of that be the lingua franca format of the notebook file. It's just a
>>>>> wasted opportunity.
>>>>
>>>> The point is that nb became a first class python object - just a module,
>>>> no need for specialized parser and you can work with it as with regular
>>>> Python module - just import and use it. The only difference is that nb is
>>>> mutable - if you modified it then you have to save it.
>>>
>>> I really don't see why having the file format be Python code makes it
>>> any more of a first class object. The objects are the first class
>>
>> You are right - not the first class, just a native python object.
>
> A file with Python code is not a native Python object. From the API
> user's perspective, they call a function or execute a statement and
> then they have an object. It's exactly the same for any format. There
> is no benefit for either case.
>
>>> objects. As long as loading to those objects is easy, the format just
>>
>> In a sense I agree, the only difference is that from the programming POV
>> loading cost for repr->nb is zero (all is done by the regular python compiler)
>> and XML->nb requires a special loader that should be maintained and updated
>> when the application changes.
>
> Actually, that raises another objection I have to using Python code as
> the file format. The file format is intimately tied to the internal
> implementation. What if we want to change the internal implementation?
> It *will* happen. With a more neutral representation, you can read
> older files as long as you update the reader appropriately. If you are
> just executing code, you are stuck with maintaining the classes with
> backwards compatible argument specs forever and won't be able to make
> some of the changes that you want down the road. Format versioning is
> a *huge* issue whenever you design file formats. XML and other generic
> representations permit this.
>
> The reason that Mathematica can get away with this is that it is a
> Lispish language. Although you correctly point out that being able to
> do things like ExpressionCell aren't particularly important, being
> able to load the code into a neutral tree structure and manipulate it
> before actually instantiating your API classes it is.
>
>>> doesn't matter. Loading an object by importing is actually a very
>>> inflexible and difficult to work with method compared to a function
>>> call.
>>
>> If one prefers functions one can always to use __import__() or imp.load_module()
>> functions instead of import statement.
>
> Abusing a Python-internal API as your main file loading API is just
> not a good practice.
>
> --
> Robert Kern
>
> "I have come to believe that the whole world is an enigma, a harmless
> enigma that is made terrible by our own mad attempt to interpret it as
> though it had an underlying truth."
>  -- Umberto Eco
>
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
>
>

-- 
Brian E. Granger, Ph.D.
Assistant Professor of Physics
Cal Poly State University, San Luis Obispo
bgranger at calpoly.edu
ellisonbg at gmail.com