BeautifulSoup vs. loose & chars
John Nagle
nagle at animats.com
Tue Dec 26 13:26:54 EST 2006
Felipe Almeida Lessa wrote:
> On 26 Dec 2006 04:22:38 -0800, placid <Bulkan at gmail.com> wrote:
>
>> So do you want to remove "&" or replace them with "&" ? If you want
>> to replace it try the following;
>
>
> I think he wants to replace them, but just the invalid ones. I.e.,
>
> This & this & that
>
> would become
>
> This & this & that
>
>
> No, i don't know how to do this efficiently. =/...
> I think some kind of regex could do it.
Yes, and the appropriate one is:
krefindamp = re.compile(r'&(?!(\w|#)+;)')
...
xmlsection = re.sub(krefindamp,'&',xmlsection)
This will replace an '&' with '&' if the '&' isn't
immediately followed by some combination of letters, numbers,
and '#' ending with a ';' Admittedly this would let something
like '&xx#2;', which isn't a legal entity, through unmodified.
There's still a potential problem with unknown entities in the output XML, but
at least they're recognized as entities.
John Nagle
More information about the Python-list
mailing list