[Tutor] Trying to parse a HUGE(1gb) xml file in python

Tue Dec 21 10:17:31 CET 2010

On Tue, Dec 21, 2010 at 4:10 AM, Stefan Behnel <stefan_ml at behnel.de> wrote:
> David Hutto, 21.12.2010 09:55:
>>
>> On Tue, Dec 21, 2010 at 3:52 AM, Stefan Behnel wrote:
>>>
>>> Chris Fuller, 21.12.2010 03:27:
>>>>
>>>> This isn't XML, it's an abomination of XML.  Best to not treat it as
>>>> XML.
>>>> Good thing you're only after one class of tags.  Here's what I'd do.
>>>>  I'll
>>>> give a general solution, but there are two parameters / four cases that
>>>> could
>>>> make the code simpler, I'll just point them out at the end.
>>>>
>>>> Iterate over the file descriptor, reading in line-by-line.  This will be
>>>> slow
>>>> on a huge file, but probably not so bad if you're only doing it once.
>>>
>>> Note that it's not unlikely that this is actually *slower* than using a
>>> real
>>> XML parser:
>>
>> Or a 'real' language like C or C++ maybe to increase, or in Python's
>> case, bypass, the interpreter?
>
> While this may be a little faster than Python code (although I suspect that
> benchmarking is needed to prove either way), I doubt that it's worth the
> overhead in code writing. If I can write a couple of lines of Python code
> that are easy to validate and almost as fast as C code, why would I want to
> write and debug hundreds of lines of code in C or C++, just to see that I
> need to tune my benchmark to notice the difference?

Don't get me wrong, I love the simplicity too, but if you know you
really do need it along the way, then you should start thinking ahead
of the easy, and toward the harder code for your project. Just as
every language has it's place, so does Python.

>
> But then, people even write XML handling code in Java, where neither
> performance nor code size is a suitable argument.
>
> Stefan
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>

-- 
They're installing the breathalyzer on my email account next week.