[Tutor] How to write a loop in python to find HTML tags in a text file

Alan Gauld alan.gauld at yahoo.co.uk
Fri Mar 19 17:16:29 EDT 2021


On 17/03/2021 11:27, S Monzur wrote:
> Thank you for explaining the process. Might you advise me on how to use
> beautiful soup on this text file to a) separate the metadata from the
> bodytext

nobody has come back with a BS solution so I'll try to address
the original issue...

>> This works for a single article (provided the div never crosses
>> a line boundary which it is perfectly entitled to do).
>> But you cannot find the closing <div> without a huge amount
>> of effort since there could be other divs within the body.

You haven't answered this impliec question. But assuming you
a) cannot use the full original html message (the easiest solution)

and

b) you don't care about the content or format of the message
   body(as implied by your partial solution)

Then you can add a loop by simply doing the following (in pseudo code)

with (open input file) as inf
    instring ==  inf.read()  #read the whole file
    messages = []
    while instring:
       msgStart = instring.find(STARTTAG))
       msgEnd = instring.find(ENDTAG, msgStart)
       messages.append(insytring[msgStart:msgEnd]
       instring = instring[msgEnd:]

You will need to convert that to Python...

>> A slightly easier approach if you have the option is to
>> keep the articles in separate files. 

This allows you to put your exoistoing code in a function and call it in
a loop:

messages = [getMessageBody(msgFile) for msgFile in os.listdir(MSGDIR)]

HTH
-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos




More information about the Tutor mailing list