Why does this fail?
dlmurray at micro-net.com
Mon Jan 5 03:58:18 CET 2004
Thank you all, this is a hell of a news group. The diversity of answers
helped me with some unasked questions, and provided more elegant solutions
to what I thought that I had figured out on my own. I appreciate it.
It's part of a spider that I'm working on to verify my own (and friends) web
page and check for broken links. Looks like making it follow robot rules
(robots.txt and meta field exclusions) is what's left.
I have found the library for html/sgml to be not very robust. Big .php and
.html with lot's of cascades and external references break it very
ungracefully (sgmllib.SGMLParseError: expected name token). I'd like to be
able to trap that stuff and just move on to the next file, accepting the
error. I'm reading in the external links and printing the title as a sanity
check in addition to collecting href anchors. This problem that I asked
about reared it's head when I started testing for a robots.txt file, which
may or may not exist.
The real point is to learn the language. When a new grad wrote a useful
utility at work in Python faster than I could have written it in C I decided
that I needed to learn Python. He's very sharp but he sold me on the
language too. Since I often must write utilities, Python seems to be a very
good thing since I normally don't have much time to kill on them.
More information about the Python-list