How to loop over a text file (to remove tags and normalize) using Python
Dan Ciprus (dciprus)
dciprus at cisco.com
Tue Mar 9 17:32:08 EST 2021
No problem, list just converts everything into plain/txt which is GREAT ! :-)
So without digging deeply into what you need to do: I am assuming that your
input contains html tags. Why don't you utilize lib like:
https://pypi.org/project/beautifulsoup4/ instead of doing harakiri with parsing
data without using regex ? Just a hint ..
On Wed, Mar 10, 2021 at 04:22:19AM +0600, S Monzur wrote:
> Thank you and apologies! I did not realize how jumbled it was at the
> receiver's end.
> The code is now at this site : [1]https://pastebin.com/wSi2xzBh
> I'm basically trying to do a few things with my code-
>
> 1. Extract 3 strings from the text- title, date and main text
>
> 2. Remove all tags afterwards
>
> 3. Save in a dictionary, with three keys- title, date and bodytext.
>
> 4. Remove punctuation and stopwords (I've used a user generated function
> for that).
>
> I've been able to do all of these steps for the file [2]ListFileReduced,
> as shown in the code (although it's clunky).
>
> But, I would like to be able to do it for the other text file: [3]ListFile
> which has more articles. I used BeautifulSoup to scrape the data from the
> website, and then generated a list that I saved as a text file.
>
> Best,
> Monzur
> On Wed, Mar 10, 2021 at 4:00 AM Dan Ciprus (dciprus)
> <[4]dciprus at cisco.com> wrote:
>
> If you could utilized pastebin or similar site to show your code, it
> would help
> tremendously since it's an unindented mess now and can not be read
> easily.
>
> On Wed, Mar 10, 2021 at 03:07:14AM +0600, S Monzur wrote:
> >Dear List,
> >
> >Newbie here. I am trying to loop over a text file to remove html tags,
> >punctuation marks, stopwords. I have already used Beautiful Soup
> (Python v
> >3.8.3) to scrape the text (newspaper articles) from the site. It
> returns a
> >list that I saved as a file. However, I am not sure how to use a loop
> in
> >order to process all the items in the text file.
> >
> >In the code below I have used listfilereduced.text(containing data from
> one
> >news article, link to listfilereduced.txt here
> ><[5]https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing>),
> >however I would like to run this code on listfile.text(containing data
> from
> >multiple articles, link to listfile.text
> ><[6]https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing>
> >).
> >
> >
> >Any help would be greatly appreciated!
> >
> >P.S. The text is in a Non-English script, but the tags are all in
> English.
> >
> >
> >#The code below is for a textfile containing just one item. I am not
> sure
> >how to tweak this to make it run for listfile.text (which contains raw
> data
> >from multiple articles) with open('listfilereduced.txt', 'r',
> >encoding='utf8') as my_file: rawData = my_file.read() print(rawData)
> >#Separating body text from other data articleStart = rawData.find("<div
> >class=\"story-element story-element-text\">") articleData =
> >rawData[:articleStart] articleBody = rawData[articleStart:]
> >print(articleData) print("*******") print(articleBody) print("*******")
> >#First, I define a function to strip tags from the body text def
> >stripTags(pageContents): insideTag = 0 text = '' for char in
> pageContents:
> >if char == '<': insideTag = 1 elif (insideTag == 1 and char == '>'):
> >insideTag = 0 elif insideTag == 1: continue else: text += char return
> text
> >#Calling the function articleBodyText = stripTags(articleBody)
> >print(articleBodyText) ##Isolating article title and publication date
> >TitleEndLoc = articleData.find("</h1>") dateStartLoc =
> >articleData.find("<div
> >class=\"storyPageMetaData-m__publish-time__19bdV\">")
> >dateEndLoc=articleData.find("<div class=\"meta-data-icons
> >storyPageMetaDataIcons-m__icons__3E4Xg\">") titleString =
> >articleData[:TitleEndLoc] dateString =
> articleData[dateStartLoc:dateEndLoc]
> >##Call stripTags to clean articleTitle= stripTags(titleString)
> articleDate
> >= stripTags(dateString) print(articleTitle) print(articleDate)
> #Cleaning
> >the date a bit more startLocDate = articleDate.find(":") endLocDate =
> >articleDate.find(",") articleDateClean =
> >articleDate[startLocDate+2:endLocDate] print(articleDateClean) #save
> all
> >this data to a dictionary that saves the title, data and the body text
> >PAloTextDict = {"Title": articleTitle, "Date": articleDateClean,
> "Text":
> >articleBodyText} print(PAloTextDict) #Normalize text by: #1. Splitting
> >paragraphs of text into lists of words articleBodyWordList =
> >articleBodyText.split() print(articleBodyWordList) #2.Removing
> punctuation
> >and stopwords from bnlp.corpus import stopwords, punctuations #A.
> Remove
> >punctuation first listNoPunct = [] for word in articleBodyWordList: for
> >mark in punctuations: word=word.replace(mark, '')
> listNoPunct.append(word)
> >print(listNoPunct) #B. removing stopwords banglastopwords = stopwords()
> >print(banglastopwords) cleanList=[] for word in listNoPunct: if word in
> >banglastopwords: continue else: cleanList.append(word) print(cleanList)
> >--
> >[7]https://mail.python.org/mailman/listinfo/python-list
>
> --
>
> Daniel Ciprus .:|:.:|:.
> CONSULTING ENGINEER.CUSTOMER DELIVERY Cisco Systems Inc.
>
> [8]dciprus at cisco.com
>
> tel: +1 703 484 0205
> mob: +1 540 223 7098
>
>References
>
> Visible links
> 1. https://pastebin.com/wSi2xzBh
> 2. https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing
> 3. https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing
> 4. mailto:dciprus at cisco.com
> 5. https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing
> 6. https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing
> 7. https://mail.python.org/mailman/listinfo/python-list
> 8. mailto:dciprus at cisco.com
--
Daniel Ciprus .:|:.:|:.
CONSULTING ENGINEER.CUSTOMER DELIVERY Cisco Systems Inc.
dciprus at cisco.com
tel: +1 703 484 0205
mob: +1 540 223 7098
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 659 bytes
Desc: not available
URL: <https://mail.python.org/pipermail/python-list/attachments/20210309/501bcb8b/attachment.sig>
More information about the Python-list
mailing list