getattr/setattr still ASCII-only, not Unicode - blows up SGMLlib from BeautifulSoup
John Nagle
nagle at animats.com
Fri Mar 14 01:53:49 EDT 2008
John Machin wrote:
> On Mar 14, 5:38 am, John Nagle <na... at animats.com> wrote:
>> Just noticed, again, that getattr/setattr are ASCII-only, and don't support
>> Unicode.
>>
>> SGMLlib blows up because of this when faced with a Unicode end tag:
>>
>> File "/usr/local/lib/python2.5/sgmllib.py", line 353, in finish_endtag
>> method = getattr(self, 'end_' + tag)
>> UnicodeEncodeError: 'ascii' codec can't encode character u'\xae'
>> in position 46: ordinal not in range(128)
>>
>> Should attributes be restricted to ASCII, or is this a bug?
>>
>> John Nagle
>
> Identifiers are restricted -- see section 2.3 (Identifiers and
> keywords) of the Reference Manual. The restriction is in effect that
> they match r'[A-Za-z_][A-Za-z0-9_]*\Z'. Hence if you can't use
> obj.nonASCIIname in your code, it makes sense for the equivalent usage
> in setattr and getattr not to be available.
>
> However other than forcing unicode to str, setattr and getattr seem
> not to care what you use:
OK. It's really a bug in SGMLlib, then. SGMLlib lets you provide a
subclass with a function with a name such as "end_img", to be called
at the end of an "img" tag. The mechanism which implements this blows
up on any tag name that won't convert to "str", even when there are
no "end_" functions that could be relevant.
It's easy to fix in SGMLlib. It's just necessary to change
except AttributeError:
to
except AttributeError, UnicodeEncodeError:
in four places. I suppose I'll have to submit a patch.
John Nagle
SiteTruth
More information about the Python-list
mailing list