getattr/setattr still ASCII-only, not Unicode - blows up SGMLlib from BeautifulSoup

Carl Banks pavlovevidence at gmail.com
Fri Mar 14 02:26:28 EDT 2008


On Mar 14, 1:53 am, John Nagle <na... at animats.com> wrote:
> John Machin wrote:
> > On Mar 14, 5:38 am, John Nagle <na... at animats.com> wrote:
> >>    Just noticed, again, that getattr/setattr are ASCII-only, and don't support
> >> Unicode.
>
> >>    SGMLlib blows up because of this when faced with a Unicode end tag:
>
> >>         File "/usr/local/lib/python2.5/sgmllib.py", line 353, in finish_endtag
> >>         method = getattr(self, 'end_' + tag)
> >>         UnicodeEncodeError: 'ascii' codec can't encode character u'\xae'
> >>         in position 46: ordinal not in range(128)
>
> >> Should attributes be restricted to ASCII, or is this a bug?
>
> >>                                         John Nagle
>
> > Identifiers are restricted -- see section 2.3 (Identifiers and
> > keywords) of the Reference Manual. The restriction is in effect that
> > they match r'[A-Za-z_][A-Za-z0-9_]*\Z'. Hence if you can't use
> > obj.nonASCIIname in your code, it makes sense for the equivalent usage
> > in setattr and getattr not to be available.
>
> > However other than forcing unicode to str, setattr and getattr seem
> > not to care what you use:
>
>     OK. It's really a bug in SGMLlib, then.  SGMLlib lets you provide a
> subclass with a function with a name such as "end_img", to be called
> at the end of an "img" tag.  The mechanism which implements this blows
> up on any tag name that won't convert to "str", even when there are
> no "end_" functions that could be relevant.
>
>     It's easy to fix in SGMLlib.  It's just necessary to change
>
>         except AttributeError:
> to
>         except AttributeError, UnicodeEncodeError:
>
> in four places.  I suppose I'll have to submit a patch.


FWIW, the stated goal of sgmllib is to parse the subset of SGML that
HTML uses.  There are no non-ascii elements in HTML, so I'm not
certain this would be considered a bug in sgmllib.


Carl Banks



More information about the Python-list mailing list