[Tutor] How to write a loop in python to find HTML tags in a text file
Alan Gauld
alan.gauld at yahoo.co.uk
Fri Mar 19 17:16:29 EDT 2021
On 17/03/2021 11:27, S Monzur wrote:
> Thank you for explaining the process. Might you advise me on how to use
> beautiful soup on this text file to a) separate the metadata from the
> bodytext
nobody has come back with a BS solution so I'll try to address
the original issue...
>> This works for a single article (provided the div never crosses
>> a line boundary which it is perfectly entitled to do).
>> But you cannot find the closing <div> without a huge amount
>> of effort since there could be other divs within the body.
You haven't answered this impliec question. But assuming you
a) cannot use the full original html message (the easiest solution)
and
b) you don't care about the content or format of the message
body(as implied by your partial solution)
Then you can add a loop by simply doing the following (in pseudo code)
with (open input file) as inf
instring == inf.read() #read the whole file
messages = []
while instring:
msgStart = instring.find(STARTTAG))
msgEnd = instring.find(ENDTAG, msgStart)
messages.append(insytring[msgStart:msgEnd]
instring = instring[msgEnd:]
You will need to convert that to Python...
>> A slightly easier approach if you have the option is to
>> keep the articles in separate files.
This allows you to put your exoistoing code in a function and call it in
a loop:
messages = [getMessageBody(msgFile) for msgFile in os.listdir(MSGDIR)]
HTH
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos
More information about the Tutor
mailing list