[BangPypers] (no subject)

Wed Sep 25 15:39:31 CEST 2013

On Tue, Sep 24, 2013 at 6:19 PM, Dhananjay Nene <dhananjay.nene at gmail.com>wrote:

> On Tue, Sep 24, 2013 at 6:11 PM, Dhananjay Nene
> <dhananjay.nene at gmail.com> wrote:
> > On Tue, Sep 24, 2013 at 6:04 PM, Dhananjay Nene
> > <dhananjay.nene at gmail.com> wrote:
> >> On Tue, Sep 24, 2013 at 5:48 PM, Vineet Naik <naikvin at gmail.com> wrote:
> >>> Hi,
> >>>
> >>> On Tue, Sep 24, 2013 at 10:38 AM, bab mis <babmis at outlook.com> wrote:
> >>>
> >>>> Hi ,Any XML parser which gives the same kind of data structure as yaml
> >>>> parser gives in python.  Tried with xmlmindom but ir's not of a proper
> >>>> datastrucure ,every time i need to read by element and create the
> dict.
> >>>>
> >>>
> >>> You can try xmltodict[1]. It also retains the node attributes and makes
> >>> than accessible using the '@' prefix (See the example in README of the
> repo)
> >>>
> >>> [1]: https://github.com/martinblech/xmltodict
> >>
> >> Being curious I immediately took a look and tried the following :
> >>
> >> import xmltodict
> >>
> >> doc1 = xmltodict.parse("""
> >> <mydocument has="an attribute">
> >>   <and>
> >>     <many>elements</many>
> >>     <many>more elements</many>
> >>   </and>
> >>   <plus a="complex">
> >>     element as well
> >>   </plus>
> >> </mydocument>
> >> """)
> >>
> >> doc2 = xmltodict.parse("""
> >> <mydocument has="an attribute">
> >>   <and>
> >>     <many>more elements</many>
> >>   </and>
> >>   <plus a="complex">
> >>     element as well
> >>   </plus>
> >> </mydocument>
> >> """)
> >> print(doc1['mydocument']['and'])
> >> print(doc2['mydocument']['and'])
> >>
> >> The output was :
> >> OrderedDict([(u'many', [u'elements', u'more elements'])])
> >> OrderedDict([(u'many', u'more elements')])
> >>
> >> The only difference is there is only one "many" node inside the "and"
> >> node in doc2. Do you see an issue here (at least I do). The output
> >> structure is a function of the cardinality of the inner nodes. Since
> >> it changes shape from a list of many to not a list of 1 but just 1
> >> element (throwing away the list). Which can make things rather
> >> unpredictable. Since you cannot predict upfront whether the existence
> >> of just one node inside a parent node is consistent with the xml
> >> schema or is just applicable in that particular instance.
> >>
> >> I do think the problem is tractable so long as one clearly documents
> >> the specific constraints which the underlying XML will satisfy,
> >> constraints which will allow transformations to lists or dicts safe.
> >> Trying to make it easy without clearly documenting the constraints
> >> could lead to violations of the principle of least surprise like
> >> above.
> >>
>

The README does mention that it's based on this "spec"[1] (or
rather a blog post) that has the assumptions. But it seems to be
missing a lot of documentation in general as well.

Out of curiosity I looked into the code to see if the author has left
any comments about this inconsistency (value type varying between lists
and unicode/OrderedDict). While there are no such comments, I found
that the `parse` function can take a keyword arg `dict_constructor`, so
any other dict-like structure can be used instead of OrderedDict.

for eg. to force every node to be inside a list irrespective of the
cardinality -

import xmltodict
from collections import defaultdict

doc2 = xmltodict.parse("""
<mydocument has="an attribute">
  <and>
    <many>more elements</many>
  </and>
  <plus a="complex">
    element as well
  </plus>
</mydocument>
""", dict_constructor=lambda *a, **k: defaultdict(list))

>>> doc2
defaultdict(<type 'list'>, {u'mydocument': [defaultdict(<type 'list'>,
{u'and': [defaultdict(<type 'list'>, {u'many': [u'more elements']})],
u'plus': [defaultdict(<type 'list'>, {'#text': [u'element as well'], u'@a':
u'complex'})], u'@has': u'an attribute'})]})

>>> doc2['mydocument'][0]['and'][0]['many']
[u'more elements']

Of course, defaultdict would lead to the order of nodes being lost, but an
"OrderedDefaultDict" (never tried before :-)) might work.

>  > It gets even more interesting, eg. below
> >
> > doc3 = xmltodict.parse("""
> > <mydocument has="an attribute">
> >   <and>
> >     <many>elements</many>
> >   </and>
> >   <plus a="complex">
> >     element as well
> >   </plus>
> >   <and>
> >     <many>more elements</many>
> >   </and>
> > </mydocument>
> > """)
> >
> > print(doc3['mydocument']['and'])
> >
> > leads to the output :
> >
> > [OrderedDict([(u'many', u'elements')]), OrderedDict([(u'many', u'more
> > elements')])]
> >
> > Definitely not what would be naively expected.
>
> Correction:
>
> print(doc3['mydocument'])
>
> prints
>
> OrderedDict([(u'@has', u'an attribute'), (u'and',
> [OrderedDict([(u'many', u'elements')]), OrderedDict([(u'many', u'more
> elements')])]), (u'plus', OrderedDict([(u'@a', u'complex'), ('#text',
> u'element as well')]))])
>
> which just trashed the ordering of an and followed by a plus followed by
> an and.
>

This is a more serious problem particularly if the dict is required to be
serialized back to xml.

Thanks for pointing out these issues, I had missed them entirely :-)

[1]:
http://www.xml.com/pub/a/2006/05/31/converting-between-xml-and-json.html

- Vineet

> Dhananjay
>
> --
>
> ----------------------------------------------------------------------------------------------------------------------------------
> http://blog.dhananjaynene.com twitter: @dnene google plus:
> http://gplus.to/dhananjaynene
> _______________________________________________
> BangPypers mailing list
> BangPypers at python.org
> https://mail.python.org/mailman/listinfo/bangpypers
>

-- 
Vineet Naik