xml processing : too slow...

Thu Jul 25 07:43:51 EDT 2002

Shagshag13 wrote:
        ...
> sorry to bother, but i get "ExpatError: junk after document element: line
> 1, column 188" and don't understand what it mean...
> 
>>>> t = """<tag0><tag1> 1 2 </tag1><tag2 attr="value">3</tag2></tag0>"""
>>>> parser.Parse(t, 0)
> Traceback (most recent call last):
>   File "<pyshell#33>", line 1, in ?
>     parser.Parse(t, 0)
> ExpatError: junk after document element: line 1, column 188

Hmmm... the first time you call parser.Parse(t,0), everything is
fine.  The SECOND time, though, the parser sees a second toplevel
tag, and that's an XML no-no.  I know of no way to "reset" an
expat parser instance to tell it to start accepting a new document
from scratch.  A call with a second argument of 1 does not appear
to perform a reset, in particular.

I guess you could trick the parser by starting with a first parse
of, e.g. '<faketoplevel>' and ending with a last parse of
'</faketoplevel>'.  However, with such an approach, many kinds
of errors would only be diagnosed at the very end, not for each
given noncompliant line.  Indeed, parser.Parse with a second
arugment of 0 may be able to diagnose SOME errors, such as a
tag being closed that was never opened, but not others -- assuming,
that is, that you DO need each single separate line to be a
well-formed XML documents (and with what other detailed
constraints, we still don't know -- your examples show only
single digits and spaces as contents, but we have no idea if
that's part of the specs or just happenstance from your examples).

I suspect you'll need to get a new parser for each line.

Take heart, though.  On my oldish PC, this script:

import time

from xml.parsers import expat
t = """<tag0><tag1> 1 2 </tag1><tag2 attr="value">3</tag2></tag0>"""

start = time.clock()
for i in range(100000):
    p = expat.ParserCreate()
    p.Parse(t,1)
stend = time.clock()

print stend-start

prints 2.16, plus or minus 1/100 of a second.  I.e., the pure
work of looping 100,000 times, creating as many parser objects
and parsing that same line repeatedly, takes a bit over 2 seconds
on a cheap PC.  I think it scales -- doing it 10,000 times takes
0.22 repeatably.  So, for your 2,000,000 lines, instantiation
and use of expat instances should be roughly 45 seconds, if your
machine is comparable to mine.  In fact, wait a minute... yeah,
two million iterations DO print 43.11 (CPU seconds) to 43.4
(elapsed according to the time command is 45.12 seconds, using
96% of my CPU).

You'll need to do more, of course (I/O, splitting, etc), but for
the task of veryfying that each line is indeed well formed XML on
its own, this approach seems to be quite viable, and should add
less than a minute to your overall program's runtime.  Not bad.

Alex