[Tutor] Extracting data from HTML files
Oswaldo Martinez
motorolaguy at gmx.net
Thu Dec 29 21:20:24 CET 2005
OK before I got in to the loop in the script I decided to try first with one
file and I have some doubts with the some parts in the script,plus I got an
error:
>>> import re
>>> file = open("file1.html")
>>> data = file.read()
>>> catRe = re.compile(r'<strong>Title:</strong>(.*?)<br><strong>')
# I searched around the docs on regexes I have and found that the "r" #after
the re.compile(' will detect repeating words.Why is this useful in #my case?
I want to read the whole string even if it has repeating words. #Also, I
dont understand the actual regex (.*?) . If I want to match #everything
inside </strong> and <br><strong> , shouldn`t I just put a "*"
# ? I tried that and it gave me an error of course.
>>> m = catRe.search(data)
>>> category = m.group(1)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: 'NoneType' object has no attribute 'group'
>>>
I also found that on some of the strings I want to extract, when python
reads them using file.read(), there are newline characters and other stuff
that doesn`t show up in the actual html source.Do I have to take these in to
account in the regex or will it automatically include them?
> --- Ursprüngliche Nachricht ---
> Von: Kent Johnson <kent37 at tds.net>
> An: Python Tutor <tutor at python.org>
> Betreff: Re: [Tutor] Extracting data from HTML files
> Datum: Thu, 29 Dec 2005 14:18:38 -0500
>
> Try something like this:
>
> def process(data):
> # this is a function you define to process the data from one file
>
> maxFileIndex = ... # whatever the max count is
> for i in range(1, maxFileIndex+1): # i will take on each value
> # from 1 to maxFileIndex
> name = 'article%s.html' % i # make a file name
> f = open(name) # open the file and read its contents
> data = f.read()
> f.close()
> process(data)
>
> Kent
>
> PS Please reply to the list
--
Telefonieren Sie schon oder sparen Sie noch?
NEU: GMX Phone_Flat http://www.gmx.net/de/go/telefonie
More information about the Tutor
mailing list