What's the best way to write this regular expression?

Paul Rubin no.email at nospam.invalid
Wed Mar 7 05:36:02 EST 2012


John Salerno <johnjsal at gmail.com> writes:
> The Beautiful Soup 4 documentation was very clear, and BS4 itself is
> so simple and Pythonic. And best of all, since version 4 no longer
> does the parsing itself, you can choose your own parser, and it works
> with lxml, so I'll still be using lxml, but with a nice, clean overlay
> for navigating the tree structure.

I haven't used BS4 but have made good use of earlier versions.

Main thing to understand is that an awful lot of HTML in the real world
is malformed and will break an XML parser or anything that expects
syntactically invalid HTML.  People tend to write HTML that works well
enough to render decently in browsers, whose parsers therefore have to
be tolerant of bad errors.  Beautiful Soup also tries to make sense of
crappy, malformed, HTML.  Partly as a result, it's dog slow compared to
any serious XML parser.  But it works very well if you don't mind the
low speed.



More information about the Python-list mailing list