Regular expression issue
Sibylle Koczian
Sibylle.Koczian at Bibliothek.Uni-Augsburg.de
Mon Jul 24 11:29:17 EDT 2006
dmbkiwi at gmail.com schrieb:
> I'm trying to parse a line of html as follows:
>
> <td style="width:20%" align="left">101.120:( KPA (-)</td>
> <td style="width:35%" align="left">Snow on Ground)0 </td>
>
> however, sometimes it looks like this:
>
> <td style="width:20%" align="left">N/A</td>
> <td style="width:35%" align="left">Snow on Ground)0 </td>
>
>
> I want to get either the numerical value 101.120 (which could be a
> different number depending on the data that's been fed into the page,
> or in terms of the second option, 'N/A'.
>
> The regexp I'm using is:
>
> .*?Pressure.*?"left">(?P<baro>\d+?|N/A)</td>|\sKPA.*?Snow\son\sGround
>
Wouldn't it be simpler to use HTMLParser or something similar first to
separate text and HTML tags and get the content of each cell separately?
Then you have only to find the 'right' cell, possibly quite simply by
its position in the HTML table, and check if it contains 'N/A' or
something numeric (that check wouldn't need a regular expression if its
really so simple).
No Python here so I can't try it out to be more specific, but look for
HTMLParser in the library reference.
--
Dr. Sibylle Koczian
Universitaetsbibliothek, Abt. Naturwiss.
D-86135 Augsburg
e-mail : Sibylle.Koczian at Bibliothek.Uni-Augsburg.DE
More information about the Python-list
mailing list