HTML Code - Line Number

Prasad, Ramit ramit.prasad at jpmorgan.com
Fri Apr 27 14:21:57 EDT 2012


> Hello,
> 
> For scrapping purposes, I am having a bit of trouble writing a block
> of code to define, and find, the relative position (line number) of a
> string of HTML code. I can pull out one string that I want, and then
> there is always a line of code, directly beneath the one I can pull
> out, that begins with the following:
> <td align="left" valign="top" class="body_cols_middle">
> 
> However, because this string of HTML code above is not unique to just
> the information I need (which I cannot currently pull out), I was
> hoping there is a way to effectively say "if you find the html string
> _____ in the line of HTML code above, and the string <td align="left"
> valign="top" class="body_cols_middle"> in the line immediately
> following, then pull everything that follows this second string.
> 
> Any thoughts as to how to define a function to do this, or do this
> some other way? All insight is much appreciated! Thanks.

You may have more long-term success in scraping by using an HTML parser like Beautiful Soup. 

Alternately, store the line and the previous line while looping and do 
something like the following.

if found:
    results.append( line )
    continue
criteria1 = '<td align="left" > valign="top" class="body_cols_middle">' in line
criteria2 = '<td align="left" valign="top" class="body_cols_middle">' in previous_line
if criteria1 and criteria2 : 
    found = True
    < maybe add rest of line to results >

Ramit


Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology
712 Main Street | Houston, TX 77002
work phone: 713 - 216 - 5423

--
This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.  



More information about the Python-list mailing list