Create an index from a webpage [RANT, DNFTT]
duncan.booth at invalid.invalid
Fri Sep 9 12:29:47 CEST 2011
Simon Cropper <simoncropper at fossworkflowguides.com> wrote:
> Certainly doable but
> considering the shear commonality of this task I don't understand why a
> simple script does not already exist - hence my original request for
I think you may have underestimated the complexity of the task in general.
To do it for a remote website you need to specify what you consider to be a
unique page. Here are some questions:
Is case significant for URLs (technically it always is, but IIS sites tend
to ignore it and to contain links with random permutations of case)?
Are there any query parameters that make two pages distinct? Or any
parameters that you should ignore? Is the order of parameters significant?
I recently came across a site that not only had multiple links to identical
pages with the query parameters in different order but also used a non-
standard % to separate parameters instead of &: it's not so easy getting
crawlers to handle that mess.
Even after ignoring query parameters are there a finite number of pages to
For example, Apache has a spelling correction module that can effectively
allow any number of spurious subfolders: I've seen a site where
"/folder1/index.html" had a link to "folder2/index.html" and
"/folder2/index.html" linked to "folder1/index.html". Apache helpfully
accepted /folder2/folder1/ as equivalent to /folder1/ and therefore by
extension also accepted /folder2/folder1/folder2/folder1/...
Zope is also good at creating infinite folder structures.
If you want to spider a remote site then there are plenty of off the shelf
spidering packages, e.g. httrack. They have a lot of configuration options
to try to handle the above gotchas.
Your case is probably a lot simpler, but that's just a few reasons why it
isn't actually a trivial task. Building a list by scanning a bunch of
folders with html files is comparatively easy which is why that is almost
always the preferred solution if possible.
Duncan Booth http://kupuguy.blogspot.com
More information about the Python-list