[ python-Bugs-1452246 ] htmllib doesn't properly substitute entities

SourceForge.net noreply at sourceforge.net
Sat Apr 1 03:13:43 CEST 2006


Bugs item #1452246, was opened at 2006-03-17 03:57
Message generated for change (Comment added) made by rvernica
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1452246&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Submitted By: Helmut Grohne (gnarfk)
Assigned to: Nobody/Anonymous (nobody)
Summary: htmllib doesn't properly substitute entities

Initial Comment:
I'd like to illustrate and suggest a fix by showing a
simple python file (which was named htmllib2.py so you
can uncomment the line in the doctest case to see that
my fix works). It's more like a hack than the fix though:
#!/usr/bin/env python2.4

"""
Use this instead of htmllib for having entitydefs
substituted in attributes,too.

Example:
>>> import htmllib
# >>> import htmllib2 as htmllib
>>> import formatter
>>> import StringIO
>>> s = StringIO.StringIO()
>>> p =
htmllib.HTMLParser(formatter.AbstractFormatter(formatter.DumbWriter(s)))
>>> p.feed('<img alt="&lt;&gt;&amp;">')
>>> s.getvalue()
'<>&'
"""

__all__ = ("HTMLParser",)

import htmllib
from htmlentitydefs import name2codepoint as entitytable

entitytable = dict([(k, chr(v)) for k, v in
entitytable.items() if v < 256])

def entitysub(s):
    ret = ""
    state = ""
    for c in s:
        if state.startswith('&'):
            if c == ';':
                ret += entitytable.get(state[1:], '%s;'
% state)
                state = ""
            else:
                state += c
        elif c == '&':
            state = c
        else:
            ret += c
    return ret

class HTMLParser(htmllib.HTMLParser):
    def handle_starttag(self, tag, method, attrs):
        """Repair attribute values."""
        attrs = [(k, entitysub(v)) for (k, v) in attrs]
        method(attrs)

if __name__ == '__main__':
    import doctest
    doctest.testmod()


----------------------------------------------------------------------

Comment By: Rares Vernica (rvernica)
Date: 2006-03-31 17:13

Message:
Logged In: YES 
user_id=1491427

This bug has been fixed on patch #1462498.

Ray

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1452246&group_id=5470


More information about the Python-bugs-list mailing list