HTML parsing confusion

Diez B. Roggisch deets at nospam.web.de
Tue Jan 22 17:39:45 CET 2008


Alnilam wrote:

> On Jan 22, 8:44 am, Alnilam <alni... at gmail.com> wrote:
>> > Pardon me, but the standard issue Python 2.n (for n in range(5, 2,
>> > -1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous
>> > 200-modules PyXML package installed. And you don't want the 75Kb
>> > BeautifulSoup?
>>
>> I wasn't aware that I had PyXML installed, and can't find a reference
>> to having it installed in pydocs. ...
> 
> Ugh. Found it. Sorry about that, but I still don't understand why
> there isn't a simple way to do this without using PyXML, BeautifulSoup
> or libxml2dom. What's the point in having sgmllib, htmllib,
> HTMLParser, and formatter all built in if I have to use use someone
> else's modules to write a couple of lines of code that achieve the
> simple thing I want. I get the feeling that this would be easier if I
> just broke down and wrote a couple of regular expressions, but it
> hardly seems a 'pythonic' way of going about things.

This is simply a gross misunderstanding of what BeautifulSoup or lxml
accomplish. Dealing with mal-formatted HTML whilst trying to make _some_
sense is by no means trivial. And just because you can come up with a few
lines of code using rexes that work for your current use-case doesn't mean
that they serve as general html-fixing-routine. Or do you think the rather
long history and 75Kb of code for BS are because it's creator wasn't aware
of rexes?

And it also makes no sense stuffing everything remotely useful into the
standard lib. This would force to align development and release cycles,
resulting in much less features and stability as it can be wished.

And to be honest: I fail to see where your problem is. BeatifulSoup is a
single Python file. So whatever you carry with you from machine to machine,
if it's capable of holding a file of your own code, you can simply put
BeautifulSoup beside it - even if it was a floppy  disk.

Diez



More information about the Python-list mailing list