HTMLParser error
jonbutler88 at googlemail.com
jonbutler88 at googlemail.com
Thu May 22 15:06:18 EDT 2008
On May 22, 9:59 am, alex23 <wuwe... at gmail.com> wrote:
> On May 22, 6:22 pm, jonbutle... at googlemail.com wrote:
>
> > Still getting very odd errors though, this being the latest:
>
> > Traceback (most recent call last):
> > File "spider.py", line 38, in <module>
> > [...snip...]
> > raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
> > httplib.InvalidURL: nonnumeric port: ''
>
> Okay. What I did was put some output in your Spider.parse method:
>
> def parse(self, page):
> try:
> print 'http://' + page
> self.feed(urlopen('http://' + page).read())
> except HTTPError:
> print 'Error getting page source'
>
> And here's the output:
>
> >python spider.py
> What site would you like to scan?http://www.google.com
> http://www.google.com
> http://http://images.google.com.au/imghp?hl=en&tab=wi
>
> The links you're finding on each page already have the protocol
> specified. I'd remove the 'http://' addition from parse, and just add
> it to 'site' in the main section.
>
> if __name__ == '__main__':
> s = Spider()
> site = raw_input("What site would you like to scan? http://")
> site = 'http://' + site
> s.crawl(site)
>
> > Also could you explain why I needed to add that
> > HTMLParser.__init__(self) line? Does it matter that I have overwritten
> > the __init__ function of spider?
>
> You haven't overwritten Spider.__init__. What you're doing every time
> you create a Spider object is first get HTMLParser to initialise it as
> it would any other HTMLParser object - which is what adds the .rawdata
> attribute to each HTMLParser instance - *and then* doing the Spider-
> specific initialisation you need.
>
> Here's an abbreviated copy of the actual HTMLParser class featuring
> only its __init__ and reset methods:
>
> class HTMLParser(markupbase.ParserBase):
> def __init__(self):
> """Initialize and reset this instance."""
> self.reset()
>
> def reset(self):
> """Reset this instance. Loses all unprocessed data."""
> self.rawdata = ''
> self.lasttag = '???'
> self.interesting = interesting_normal
> markupbase.ParserBase.reset(self)
>
> When you initialise an instance of HTMLParser, it calls its reset
> method, which sets rawdata to an empty string, or adds it to the
> instance if it doesn't already exist. So when you call
> HTMLParser.__init__(self) in Spider.__init__(), it executes the reset
> method on the Spider instance, which it inherits from HTMLParser...
>
> Are you familiar with object oriented design at all? If you're not,
> let me know and I'll track down some decent intro docs. Inheritance is
> a pretty fundamental concept but I don't think I'm doing it justice.
Nope, this is my first experience with object oriented programming,
only been learning python for a few weeks but it seemed simple enough
to inspire me to be a bit ambitious. If you could hook me up with some
good docs that would be great. I was about to but a book on python,
specifically OO based, but il look at these docs first. I understand
most of the concepts of inheritance, just not ever used them before.
Thanks
More information about the Python-list
mailing list