[New-bugs-announce] [issue39833] Bug in html parsing module triggered by malformed input

Evan report at bugs.python.org
Mon Mar 2 21:16:05 EST 2020


New submission from Evan <ep5880a at student.american.edu>:

Relevant base python library--   C:\Users\User\AppData\Local\Programs\Python\Python38\lib\_markupbase.py 


The issue- After parsing over 900 SEC filings using beautifulsoup4, I get this user warning. 

UserWarning: unknown status keyword 'ERF' in marked section
  warnings.warn(msg)

Followed by a traceback
....
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python38\lib\site-packages\bs4\__init__.py", line 325, in __init__
    self._feed()
....

File "C:\Users\XXXX\AppData\Local\Programs\Python\Python38\lib\_markupbase.py", line 160, in parse_marked_section
    if not match:
UnboundLocalError: local variable 'match' referenced before assignment

It's probably to due to malformed input from on of the docs.

144 lines into _markupbase lib we have:

    # Internal -- parse a marked section
    # Override this to handle MS-word extension syntax <![if word]>content<![endif]>
    def parse_marked_section(self, i, report=1):
        rawdata= self.rawdata
        assert rawdata[i:i+3] == '<![', "unexpected call to parse_marked_section()"
        sectName, j = self._scan_name( i+3, i )
        if j < 0:
            return j
        if sectName in {"temp", "cdata", "ignore", "include", "rcdata"}:
            # look for standard ]]> ending
            match= _markedsectionclose.search(rawdata, i+3)
        elif sectName in {"if", "else", "endif"}:
            # look for MS Office ]> ending
            match= _msmarkedsectionclose.search(rawdata, i+3)
        else:
            self.error('unknown status keyword %r in marked section' % rawdata[i+3:j])
        if not match:
            return -1
        if report:
            j = match.start(0)
            self.unknown_decl(rawdata[i+3: j])
        return match.end(0)

`match` should be set to None in the fall-through else statement right before `if not match`.

----------
components: Library (Lib)
messages: 363234
nosy: SanJacintoJoe
priority: normal
severity: normal
status: open
title: Bug in html parsing module triggered by malformed input
type: compile error
versions: Python 3.8

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue39833>
_______________________________________


More information about the New-bugs-announce mailing list