python fast HTML data extraction library
pinkeen at gmail.com
Thu Jul 23 00:43:08 CEST 2009
Sometime ago I was searching for a library that would simplify mass
data scraping/extraction from webpages. Python XPath implementation
seemed like the way to go. The problem was that most of the HTML on
the net doesn't conform to XML standards, even the XHTML (those
advertised as valid XHTML too) pages.
I tried to fix that with BeautifulSoup + regexp filtering of some
particular cases I encountered. That was slow and after running my
data scraper for some time a lot of new problems (exceptions from
xpath parser) were showing up. Not to mention that BeautifulSoup
stripped almost all of the content from some heavily broken pages
(50+KiB page stripped down to some few hundred bytes). Character
encoding conversion was a hell too - even UTF-8 pages had some non-
standard characters causing issues.
Cutting to the chase - that's when I decided to take the matter into
my own hands. I hacked together a solution sporting completely new
approach overnight. It's called htxpath - a small, lightweight (also
without dependencies) python library which lets you to extract
specific tag(s) from a HTML document using a path string which has
very similar syntax to xpath (but is more convenient in some cases).
It did a very good job for me.
My library, rather than parsing the whole input into a tree, processes
it like a char stream with regular expressions.
I decided to share it with everyone so there it is: http://code.google.com/p/htxpath/
I am aware that it is not beautifully coded as my experience with
python is rather brief, but I am curious if it will be useful to
anyone (also it's my first potentially [real-world ;)] useful project
gone public). In that case I promise to continue developing it. It's
probably full of bugs, but I can't catch them all by myself.
More information about the Python-list