sgmllib.py
Stefan Behnel
stefan_ml at behnel.de
Mon Aug 24 03:08:07 EDT 2009
elsa wrote:
> I'm new to both this forum and Python, and I've got a bit stuck trying
> to learn how to parse HTML...
If what you want to do is *parse* the HTML instead of trying to *learn* how
to parse it, you might want to give the existing (external) HTML parser
libraries a try. There's lxml.html (extremely fast and fixes up broken
HTML), html5lib (very slow, but very browser-like parse results) and
BeautifulSoup (slow, but good encoding detection if you need that).
Here are a couple of (only slightly biased) comparisons:
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/
> python sgmllib.py "path/to/my/file.html" .... example (1)
>
> this doesn't work for me. I think I have figured out the problem -
> the error says
>
> "/System/Library/Frameworks/Python.framework/Versions/2.5/Resources/
> Python.app/Contents/MacOS/Python: can't open file 'sgmllib.py': [Errno
> 2] No such file or directory"
>
> the problem is that this path is wrong. My sgmllib.py is in:
>
> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/
> python2.5/sgmllib.py"
You can use "python -m sgmllib" to call a module from the stdlib (or the
PYTHONPATH, to be more accurate).
But note that sgmllib is a particularly cumbersome way to deal with HTML.
Stefan
More information about the Python-list
mailing list