mangled attempt at using htmllib

Ari Davidow ari_deja at ivritype.com
Tue Oct 17 14:36:44 EDT 2000


Wow! I got sick for a few days and missed this very, very useful
tutorial. As it happens, my goals were slightly different that was
apparent from the code:

>> 200 OK        <a href="urlstatusgo.html?col=test&url= /
http%3A//www.foobar.com/archive/091400.html"> /
http://www.foobar.com/archive/091400.html</a>
>
>Well, I think your first misapprehension is that you appear to be
expecting HTTP back >form the urllib readlines() call, when in fact the
HTTP is stripped off, and what *you* >see is just the HTML!

I knew that this particular page would yield such lines. The idea was
to evaluate each such line and grap the URL between the anchor_bgn and
anchor_end, in the example shown, a simple

http://www.foobar.com/archive/091400.html

This might have been done more simply with regular expression, e.g.,

   myUrl = re.search(r'<a href.*?>(.*?)</a>)

because, as I seem to be discovering, the "handle_data" stuff in my
parser class

>>    def handle_data(self, data):
>>            self.c_data=self.c_data+data

doesn't refer to the data inside the anchor tag, which is what I
wanted, but to something else (or, my current modules aren't asking for
the right thing the right way, because printing the contents of
self.c_data gives me "none" as a result.

Anyway, just getting straight on idiosyncracies of htmllib and being
reminded that cutting and pasting python code almost ALWAYS requires
attention paid to spaces--tabs convert oddly, and the interpreter on my
machine sees them as different, regardless of what they look like, has
moved me forward in very nice, useful ways.

Thank you!
ari

--
Ari Davidow
ari at ivritype.com


Sent via Deja.com http://www.deja.com/
Before you buy.



More information about the Python-list mailing list