BeautifulSoup vs. loose & chars
Frederic Rentsch
anthra.norell at vtxmail.ch
Tue Dec 26 18:30:51 EST 2006
John Nagle wrote:
> Felipe Almeida Lessa wrote:
>
>> On 26 Dec 2006 04:22:38 -0800, placid <Bulkan at gmail.com> wrote:
>>
>>
>>> So do you want to remove "&" or replace them with "&" ? If you want
>>> to replace it try the following;
>>>
>> I think he wants to replace them, but just the invalid ones. I.e.,
>>
>> This & this & that
>>
>> would become
>>
>> This & this & that
>>
>>
>> No, i don't know how to do this efficiently. =/...
>> I think some kind of regex could do it.
>>
>
> Yes, and the appropriate one is:
>
> krefindamp = re.compile(r'&(?!(\w|#)+;)')
> ...
> xmlsection = re.sub(krefindamp,'&',xmlsection)
>
> This will replace an '&' with '&' if the '&' isn't
> immediately followed by some combination of letters, numbers,
> and '#' ending with a ';' Admittedly this would let something
> like '&xx#2;', which isn't a legal entity, through unmodified.
>
> There's still a potential problem with unknown entities in the output XML, but
> at least they're recognized as entities.
>
> John Nagle
>
>
>
Here's another idea:
>>> s = '''<html> htm tag should not translate
> & should be &
> &xx#2; isn't a legal entity and should translate
> { is a legal entity and should not translate
</html>
>>> import SE # http://cheeseshop.python.org/pypi/SE/2.3
>>> HTM_Escapes = SE.SE (definitions) # See definitions below the
dotted line
>>> print HTM_Escapes (s)
<html> htm tag should not translate
> & should be &
> &xx#2; isn"t a legal entity and should translate
> { is a legal entity and should not translate
</html>
Regards
Frederic
------------------------------------------------------------------------------
definitions = '''
# Do # Don't do
# " = " == # 32 20
(34)=&dquot; &dquot;== # 34 22
&=& &== # 38 26
'=" "== # 39 27
<=< <== # 60 3c
>=> >== # 62 3e
©=© ©== # 169 a9
·=· ·== # 183 b7
»=» »== # 187 bb
À=À À== # 192 c0
Á=Á Á== # 193 c1
Â=Â Â== # 194 c2
Ã=Ã Ã== # 195 c3
Ä=Ä Ä== # 196 c4
Å=Å Å== # 197 c5
Æ=Æ Æ== # 198 c6
Ç=Ç Ç== # 199 c7
È=È È== # 200 c8
É=É É== # 201 c9
Ê=Ê Ê== # 202 ca
Ë=Ë Ë== # 203 cb
Ì=Ì Ì== # 204 cc
Í=Í Í== # 205 cd
Î=Î Î== # 206 ce
Ï=Ï Ï== # 207 cf
Ð=&Eth; &Eth;== # 208 d0
Ñ=Ñ Ñ== # 209 d1
Ò=Ò Ò== # 210 d2
Ó=Ó Ó== # 211 d3
Ô=Ô Ô== # 212 d4
Õ=Õ Õ== # 213 d5
Ö=Ö Ö== # 214 d6
Ø=Ø Ø== # 216 d8
Ù=&Ugrve; &Ugrve;== # 217 d9
Ú=Ú Ú== # 218 da
Û=Û Û== # 219 db
Ü=Ü Ü== # 220 dc
Ý=Ý Ý== # 221 dd
Þ=&Thorn; &Thorn;== # 222 de
ß=ß ß== # 223 df
à=à à== # 224 e0
á=á á== # 225 e1
â=â â== # 226 e2
ã=ã ã== # 227 e3
ä=ä ä== # 228 e4
å=å å== # 229 e5
æ=æ æ== # 230 e6
ç=ç ç== # 231 e7
è=è è== # 232 e8
é=é é== # 233 e9
ê=ê ê== # 234 ea
ë=ë ë== # 235 eb
ì=ì ì== # 236 ec
í=í í== # 237 ed
î=î î== # 238 ee
ï=ï ï== # 239 ef
ð=ð ð== # 240 f0
ñ=ñ ñ== # 241 f1
ò=ò ò== # 242 f2
ó=ó ó== # 243 f3
ô=&ocric; &ocric;== # 244 f4
õ=õ õ== # 245 f5
ö=ö ö== # 246 f6
ø=ø ø== # 248 f8
ù=ù ù== # 249 f9
ú=ú ú== # 250 fa
û=û û== # 251 fb
ü=ü ü== # 252 fc
ý=ý ý== # 253 fd
þ=þ þ== # 254 fe
(xff)=ÿ # 255 ff
&#== # All numeric codes
"~<(.|\n)*?>~==" # All HTM tags '''
If the ampersand is all you need to handle you can erase the others
in the first column. You need to keep the second column though, except
the last entry, because the tags don't need protection if '<' and
'>' in the first column are gone.
Definitions are easily edited and can be kept in text files.
The SE constructor accepts a file name instead of a definitions string:
>>> HTM_Escapes = SE.SE ('definition_file_name')
-------------------------------------------------------------------
More information about the Python-list
mailing list