Create an index from a webpage [RANT, DNFTT]

Duncan Booth duncan.booth at invalid.invalid
Fri Sep 9 06:29:47 EDT 2011


Simon Cropper <simoncropper at fossworkflowguides.com> wrote:

> Certainly doable but 
> considering the shear commonality of this task I don't understand why a 
> simple script does not already exist - hence my original request for 
> assistance.

I think you may have underestimated the complexity of the task in general.

To do it for a remote website you need to specify what you consider to be a 
unique page. Here are some questions:

Is case significant for URLs (technically it always is, but IIS sites tend 
to ignore it and to contain links with random permutations of case)?

Are there any query parameters that make two pages distinct? Or any 
parameters that you should ignore? Is the order of parameters significant? 
I recently came across a site that not only had multiple links to identical 
pages with the query parameters in different order but also used a non-
standard % to separate parameters instead of &: it's not so easy getting 
crawlers to handle that mess.

Even after ignoring query parameters are there a finite number of pages to 
the site?
For example, Apache has a spelling correction module that can effectively 
allow any number of spurious subfolders: I've seen a site where 
"/folder1/index.html" had a link to "folder2/index.html" and 
"/folder2/index.html" linked to "folder1/index.html". Apache helpfully 
accepted /folder2/folder1/ as equivalent to /folder1/ and therefore by 
extension also accepted /folder2/folder1/folder2/folder1/...
Zope is also good at creating infinite folder structures.

If you want to spider a remote site then there are plenty of off the shelf 
spidering packages, e.g. httrack. They have a lot of configuration options 
to try to handle the above gotchas.

Your case is probably a lot simpler, but that's just a few reasons why it 
isn't actually a trivial task. Building a list by scanning a bunch of 
folders with html files is comparatively easy which is why that is almost 
always the preferred solution if possible.

-- 
Duncan Booth http://kupuguy.blogspot.com



More information about the Python-list mailing list