How to loop over a text file (to remove tags and normalize) using Python

Tue Mar 9 17:00:27 EST 2021

If you could utilized pastebin or similar site to show your code, it would help 
tremendously since it's an unindented mess now and can not be read easily.

On Wed, Mar 10, 2021 at 03:07:14AM +0600, S Monzur wrote:
>Dear List,
>
>Newbie here. I am trying to loop over a text file to remove html tags,
>punctuation marks, stopwords. I have already used Beautiful Soup (Python v
>3.8.3) to scrape the text (newspaper articles) from the site. It returns a
>list that I saved as a file. However, I am not sure how to use a loop in
>order to process all the items in the text file.
>
>In the code below I have used listfilereduced.text(containing data from one
>news article, link to listfilereduced.txt here
><https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing>),
>however I would like to run this code on listfile.text(containing data from
>multiple articles, link to listfile.text
><https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing>
>).
>
>
>Any help would be greatly appreciated!
>
>P.S. The text is in a Non-English script, but the tags are all in English.
>
>
>#The code below is for a textfile containing just one item. I am not sure
>how to tweak this to make it run for listfile.text (which contains raw data
>from multiple articles) with open('listfilereduced.txt', 'r',
>encoding='utf8') as my_file: rawData = my_file.read() print(rawData)
>#Separating body text from other data articleStart = rawData.find("<div
>class=\"story-element story-element-text\">") articleData =
>rawData[:articleStart] articleBody = rawData[articleStart:]
>print(articleData) print("*******") print(articleBody) print("*******")
>#First, I define a function to strip tags from the body text def
>stripTags(pageContents): insideTag = 0 text = '' for char in pageContents:
>if char == '<': insideTag = 1 elif (insideTag == 1 and char == '>'):
>insideTag = 0 elif insideTag == 1: continue else: text += char return text
>#Calling the function articleBodyText = stripTags(articleBody)
>print(articleBodyText) ##Isolating article title and publication date
>TitleEndLoc = articleData.find("</h1>") dateStartLoc =
>articleData.find("<div
>class=\"storyPageMetaData-m__publish-time__19bdV\">")
>dateEndLoc=articleData.find("<div class=\"meta-data-icons
>storyPageMetaDataIcons-m__icons__3E4Xg\">") titleString =
>articleData[:TitleEndLoc] dateString = articleData[dateStartLoc:dateEndLoc]
>##Call stripTags to clean articleTitle= stripTags(titleString) articleDate
>= stripTags(dateString) print(articleTitle) print(articleDate) #Cleaning
>the date a bit more startLocDate = articleDate.find(":") endLocDate =
>articleDate.find(",") articleDateClean =
>articleDate[startLocDate+2:endLocDate] print(articleDateClean) #save all
>this data to a dictionary that saves the title, data and the body text
>PAloTextDict = {"Title": articleTitle, "Date": articleDateClean, "Text":
>articleBodyText} print(PAloTextDict) #Normalize text by: #1. Splitting
>paragraphs of text into lists of words articleBodyWordList =
>articleBodyText.split() print(articleBodyWordList) #2.Removing punctuation
>and stopwords from bnlp.corpus import stopwords, punctuations #A. Remove
>punctuation first listNoPunct = [] for word in articleBodyWordList: for
>mark in punctuations: word=word.replace(mark, '') listNoPunct.append(word)
>print(listNoPunct) #B. removing stopwords banglastopwords = stopwords()
>print(banglastopwords) cleanList=[] for word in listNoPunct: if word in
>banglastopwords: continue else: cleanList.append(word) print(cleanList)
>-- 
>https://mail.python.org/mailman/listinfo/python-list

-- 

Daniel Ciprus                              .:|:.:|:.
CONSULTING ENGINEER.CUSTOMER DELIVERY   Cisco Systems Inc.

dciprus at cisco.com

tel: +1 703 484 0205
mob: +1 540 223 7098

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 659 bytes
Desc: not available
URL: <https://mail.python.org/pipermail/python-list/attachments/20210309/fcb079c2/attachment.sig>