Regex help needed!
Peter Otten
__peter__ at web.de
Mon Dec 21 07:58:55 EST 2009
Oltmans wrote:
> I've a string that looks something like
> ----
> lksjdfls <div id ='amazon_345343'> kdjff lsdfs </div> sdjfls <div id
> = "amazon_35343433">sdfsd</div><div id='amazon_8898'>welcome</div>
> ----
>
> From above string I need the digits within the ID attribute. For
> example, required output from above string is
> - 35343433
> - 345343
> - 8898
>
> I've written this regex that's kind of working
> re.findall("\w+\s*\W+amazon_(\d+)",str)
>
> but I was just wondering that there might be a better RegEx to do that
> same thing. Can you kindly suggest a better/improved Regex. Thank you
> in advance.
>>> from BeautifulSoup import BeautifulSoup
>>> bs = BeautifulSoup("""lksjdfls <div id ='amazon_345343'> kdjff lsdfs
</div> sdjfls <div id
... = "amazon_35343433">sdfsd</div><div id='amazon_8898'>welcome</div>""")
>>> [node["id"][7:] for node in bs(id=lambda id: id.startswith("amazon_"))]
[u'345343', u'35343433', u'8898']
I think BeautifulSoup is a better tool for the task since it actually
"understands" HTML.
Peter
More information about the Python-list
mailing list