Regex help needed!

Peter Otten __peter__ at web.de
Mon Dec 21 07:58:55 EST 2009


Oltmans wrote:

> I've a string that looks something like
> ----
> lksjdfls <div id ='amazon_345343'> kdjff lsdfs </div> sdjfls <div id
> =   "amazon_35343433">sdfsd</div><div id='amazon_8898'>welcome</div>
> ----
> 
> From above string I need the digits within the ID attribute. For
> example, required output from above string is
> - 35343433
> - 345343
> - 8898
> 
> I've written this regex that's kind of working
> re.findall("\w+\s*\W+amazon_(\d+)",str)
> 
> but I was just wondering that there might be a better RegEx to do that
> same thing. Can you kindly suggest a better/improved Regex. Thank you
> in advance.

>>> from BeautifulSoup import BeautifulSoup
>>> bs = BeautifulSoup("""lksjdfls <div id ='amazon_345343'> kdjff lsdfs 
</div> sdjfls <div id
... =   "amazon_35343433">sdfsd</div><div id='amazon_8898'>welcome</div>""")
>>> [node["id"][7:] for node in bs(id=lambda id: id.startswith("amazon_"))]
[u'345343', u'35343433', u'8898']

I think BeautifulSoup is a better tool for the task since it actually 
"understands" HTML.

Peter



More information about the Python-list mailing list