[Tutor] Extracting data from HTML files

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Thu Dec 29 22:38:50 CET 2005


> >>> import re
> >>> file = open("file1.html")
> >>> data = file.read()
> >>> catRe = re.compile(r'<strong>Title:</strong>(.*?)<br><strong>')
>
> # I searched around the docs on regexes I have and found that the "r"
> # after the re.compile(' will detect repeating words.

Hi Oswaldo,

Actually, no.  What you're seeing is a "raw" string literal.  See:

http://www.amk.ca/python/howto/regex/regex.html#SECTION000420000000000000000

for more details about this.  The idea is that we often want to make
strings where backslashes are just literally backslashes, rather than
treated by Python as escape characters.

The Regular Expression HOWTO itself is pretty good and talks about some of
the stuff you've been running into, so here's a link to the base url that
you may want to look at:

    http://www.amk.ca/python/howto/regex/


> I want to read the whole string even if it has repeating words.  #Also,
> I dont understand the actual regex (.*?) . If I want to match
> #everything inside </strong> and <br><strong> , shouldn`t I just put a
> "*" #?

You're confusing the "globbing" notation used in Unix shells with the
miniature pattern language used in regular expressions.  They both use
similar symbols, but with totally different interpretations.  Be aware of
this context, as it's easy to get confused because of their surface
similarities.

For example,

    "ab*"

under a globbing interpretation means:

    'a' and 'b', followed by any number of characters.

But under a regular expression interpretation, this means:

    'a', followed by any number of 'b's.


As a followup: to express the idea: "'a' and 'b', followed by any number
of characters," as a regular expression pattern, we'd write:

    "ab.*"

So any globbing pattern can be translated fairly easily to a regular
expression pattern.  However, going the other way don't usually work: it's
often not possible to take an arbitrary regular expression, like "ab*",
and make it work as a glob.  So regular expressions are more expressive
than globs, but with that power comes great resp... err, I mean, more
complexity.  *grin*


> I also found that on some of the strings I want to extract, when python
> reads them using file.read(), there are newline characters and other
> stuff that doesn`t show up in the actual html source.

Not certain that I understand what you mean there.  Can you show us?
read() should not adulterate the byte stream that comes out of your files.


> Do I have to take these in to account in the regex or will it
> automatically include them?

Newlines are, by default, handled differently than other characters. You
can add an 're.DOTALL' flag so that newlines are also matched by the '.'
regular expression metacharacter; see the Regex HOWTO above to see how
this might work.

As an aside: the problems you're running into is very much why we
encourage folks not to process HTML with regular expressions: RE's also
come with their own somewhat-high learning curve.


Good luck to you.



More information about the Tutor mailing list