[Tutor] Another regular expression question

Kent Johnson kent37 at tds.net
Wed Sep 14 16:55:43 CEST 2005


Bernard Lebel wrote:
> Thanks for that pointer Kent, I'll check it out. Also thanks for
> letting me know I'm not nuts! :-)
> 
> Alan's suggestion about BeautifulSoup is actually excellent. The
> documentation is nice and the tool is very easy to use.
> 
> However is it normal that to parse a 2618 lines xml file it takes
> 20-30 seconds or so?

That seems slow to me unless the lines are really long! How many bytes is the file? But I don't have much experience with BeautifulSoup.

ElementTree is fast and cElementTree (the C implementation) is really fast. I have used it to read, process and write a 28 MB XML file, it took about 10 seconds.

Kent

> 
> 
> Thanks
> Bernard
> 
> 
> 
> On 9/14/05, Kent Johnson <kent37 at tds.net> wrote:
> 
>>Bernard Lebel wrote:
>>
>>>Thanks Alan,
>>>
>>>I'll check BeautifulSoup asap.
>>>
>>>I'm using regex simply because I have no clue where to start to parse
>>>XML. I have read the various xml tools available in the Python
>>>library, however I'm a complete loss at what to make out of them. Many
>>>of them seem to use some programming standards, wich I am completely
>>>unfamiliar with (this is the first time that I dig into XML writing
>>>and parsing).
>>>
>>>I don't know where to start to learn about all these standards, and as
>>>usual with new programming things, the documentation is hard to
>>>swallow (it usually is written more as a reference than a proper user
>>>guide/tutorial). I have to admit this is very frustrating, so if I'm
>>>looking at things from a wrong perspective please advise me, I need
>>>it.
>>
>>I agree that the Python XML story is confusing even for the files in the standard library. Worse, the (IMO) best solutions are not to be found in the standard lib or PyXML at all.
>>
>>The std lib and PyXML are based on the DOM and SAX standards. These standards were designed to be "language-neutral" - there are implementations in Python, Java and other languages. The good side of this is, if you learn how to use them, the knowledge is pretty portable to other languages. The bad side is, the APIs defined by the standard are IMO clunky and painful to use, especially in Python.
>>
>>There is a current thread on comp.lang.python discussing this with good suggestions and pointers to more info:
>>http://groups.google.com/group/comp.lang.python/browse_frm/thread/a48891aa645ead13/dcd8fdc20b4b191b?hl=en#dcd8fdc20b4b191b
>>
>>My personal preference is ElementTree. Beautiful Soup is good too though I have only tried it with HTML. If I was running on Linux I would try lxml which uses the ElementTree API and adds full XPath support. Amara looks like the Cadillac solution - big and cushy. I haven't tried it. Uche's articles (referenced in the thread above) have pointers to many other choices but these seem to be the most popular.
>>
>>My favorite XML lib is actually dom4j which is in Java. It works great with Jython.
>>
>>Kent
>>
>>
>>>So right now I'm just taking a shortcut and using ultra-simple
>>>re-based parser to retrieve the tags I'm looking for. I know it will
>>>probably be slow, but hopefully I'll get familiar with sophisticated
>>>parsing in the future and improve my code. As it stands right now,
>>>even the re syntax is not super easy to learn.
>>
>>For what you are doing re seems fine to me. You can get in trouble using re's with XML because of nested tags, variations in spelling and order, probably a bunch of other things. But for simple stuff it can work fine.
>>
>>Kent
>>
>>
>>>
>>>Kent: That works (of course!). Thanks a bunch once again!
>>>
>>>
>>>Thanks
>>>Bernard
>>>
>>>On 9/14/05, Alan G <alan.gauld at freenet.co.uk> wrote:
>>>
>>>
>>>>Hi Bernard,
>>>>
>>>>
>>>>
>>>>>Hello, yet another regular expression question :-)
>>>>>
>>>>>So I have this xml file that I'm trying to find a
>>>>>specific tag in.
>>>>
>>>>I'm always suspicious when I see regular expression
>>>>and xml/html in the same context. regex are not good
>>>>for parsing xml/html files and it's usually much easier
>>>>to use a proper parser - such as beautiful soup.
>>>>
>>>>http://www.crummy.com/software/BeautifulSoup/
>>>>
>>>>Is there any special reason why you are using a regex
>>>>sledgehammer to crack this particular nut? Or is it
>>>>just to gain experience using regex?
>>>>
>>>>Alan G.
>>>>
>>>
>>>_______________________________________________
>>>Tutor maillist  -  Tutor at python.org
>>>http://mail.python.org/mailman/listinfo/tutor
>>>
>>>
>>
>>_______________________________________________
>>Tutor maillist  -  Tutor at python.org
>>http://mail.python.org/mailman/listinfo/tutor
>>
> 
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
> 
> 



More information about the Tutor mailing list