BeautifulSoup vs. loose & chars

Duncan Booth duncan.booth at invalid.invalid
Tue Dec 26 12:05:11 EST 2006


"Felipe Almeida Lessa" <felipe.lessa at gmail.com> wrote:

> On 26 Dec 2006 04:22:38 -0800, placid <Bulkan at gmail.com> wrote:
>> So do you want to remove "&" or replace them with "&" ? If you
>> want to replace it try the following;
> 
> I think he wants to replace them, but just the invalid ones. I.e.,
> 
> This & this & that
> 
> would become
> 
> This & this & that
> 
> 
> No, i don't know how to do this efficiently. =/...
> I think some kind of regex could do it.
> 

Since he's asking for valid xml as output, it isn't sufficient just to
ignore entity definitions: HTML has a lot of named entities such as
  but xml only has a very limited set of predefined named entities.
The safest technique is to convert them all to numeric escapes except
for the very limited set also guaranteed to be available in xml. 

Try this:

from cgi import escape
import re
from htmlentitydefs import name2codepoint
name2codepoint = name2codepoint.copy()
name2codepoint['apos']=ord("'")

EntityPattern =
re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));') 

def decodeEntities(s, encoding='utf-8'): 
    def unescape(match):
	code = match.group(1)
        if code:
            return unichr(int(code, 10))
        else:
            code = match.group(2)
            if code:
                return unichr(int(code, 16))
	    else:
                return unichr(name2codepoint[match.group(3)])
    return EntityPattern.sub(unescape, s)

>>> escape(
    decodeEntities("This & this & that é")).encode(
        'ascii', 'xmlcharrefreplace') 
'This & this & that é'


P.S. apos is handled specially as it isn't technically a
valid html entity (and Python doesn't include it in its entity
list), but it is an xml entity and recognised by many browsers so some
people might use it in html.
 



More information about the Python-list mailing list