Good way to remove/catch wrong tags?
data:image/s3,"s3://crabby-images/fb1c4/fb1c4548d2bd8fea23256ca435536d0faf51fc48" alt=""
Hello, Some columns in a DB have badly formed HTML, to the point BeautifulSoup (lxml?) fails: ============= #Some records start with 0A</crap> soup = BeautifulSoup("\n</strong>", 'lxml') #AttributeError: 'NoneType' object has no attribute 'text' print(soup.body.text) ============= What would be a nice way to solve the problem? Is there a command to remove wrong tags altogether (eg. strings that starts with </strong>), or should I just catch the error? Thank you.
data:image/s3,"s3://crabby-images/fb1c4/fb1c4548d2bd8fea23256ca435536d0faf51fc48" alt=""
As a work-around, if there's only a handful of wrong records, catching the error and fixing the records in the DB does the job: ======= try: #file.write(soup.body.text) text = soup.body.text except AttributeError as error: file.write(str(error)) ========
data:image/s3,"s3://crabby-images/fb1c4/fb1c4548d2bd8fea23256ca435536d0faf51fc48" alt=""
As a work-around, if there's only a handful of wrong records, catching the error and fixing the records in the DB does the job: ======= try: #file.write(soup.body.text) text = soup.body.text except AttributeError as error: file.write(str(error)) ========
participants (1)
-
codecomplete@free.fr