[Tutor] Extracting data from HTML files

Oswaldo Martinez motorolaguy at gmx.net
Thu Dec 29 21:20:24 CET 2005


OK before I got in to the loop in the script I decided to try first with one
file and I have some doubts with the some parts in the script,plus I got an
error:

>>> import re
>>> file = open("file1.html")
>>> data = file.read()
>>> catRe = re.compile(r'<strong>Title:</strong>(.*?)<br><strong>')

# I searched around the docs on regexes I have and found that the "r" #after
the re.compile(' will detect repeating words.Why is this useful in #my case?
I want to read the whole string even if it has repeating words.  #Also, I
dont understand the actual regex (.*?) . If I want to match #everything
inside </strong> and <br><strong> , shouldn`t I just put a "*"
# ? I tried that and it  gave me an error of course.

>>> m = catRe.search(data)
>>> category = m.group(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
AttributeError: 'NoneType' object has no attribute 'group'
>>>

I also found that on some of the strings I want to extract, when python
reads them using file.read(), there are newline characters and other stuff
that doesn`t show up in the actual html source.Do I have to take these in to
account in the regex or will it automatically include them?




> --- Ursprüngliche Nachricht ---
> Von: Kent Johnson <kent37 at tds.net>
> An: Python Tutor <tutor at python.org>
> Betreff: Re: [Tutor] Extracting data from HTML files
> Datum: Thu, 29 Dec 2005 14:18:38 -0500
> 
> Try something like this:
> 
> def process(data):
>    # this is a function you define to process the data from one file
> 
> maxFileIndex = ... # whatever the max count is
> for i in range(1, maxFileIndex+1):  # i will take on each value
>                                      # from 1 to maxFileIndex
>    name = 'article%s.html' % i  # make a file name
>    f = open(name)  # open the file and read its contents
>    data = f.read()
>    f.close()
>    process(data)
> 
> Kent
> 
> PS Please reply to the list

-- 
Telefonieren Sie schon oder sparen Sie noch?
NEU: GMX Phone_Flat http://www.gmx.net/de/go/telefonie


More information about the Tutor mailing list