unescape HTML entities
Frederic Rentsch
anthra.norell at vtxmail.ch
Sun Oct 29 07:29:32 EST 2006
Rares Vernica wrote:
> Hi,
>
> How can I unescape HTML entities like " "?
>
> I know about xml.sax.saxutils.unescape() but it only deals with "&",
> "<", and ">".
>
> Also, I know about htmlentitydefs.entitydefs, but not only this
> dictionary is the opposite of what I need, it does not have " ".
>
> It has to be in python 2.4.
>
> Thanks a lot,
> Ray
>
One way is this:
>>> import SE #
Download from http://cheeseshop.python.org/pypi/SE/2.2%20beta
>>> SE.SE ('HTM2ISO.se')('input_file_name', 'output_file_name') #
HTM2ISO.se is included
'output_file_name'
For repeated translations the SE object would be assigned to a variable:
>>> HTM_Decoder = SE.SE ('HTM2ISO.se')
SE objects take and return strings as well as file names which is useful
for translating string variables, doing line-by-line translations and
for interactive development or verification. A simple way to check a
substitution set is to use its definitions as test data. The following
is a section of the definition file HTM2ISO.se:
test_string = '''
ø=(xf8) # 248 f8
ù=(xf9) # 249 f9
ú=(xfa) # 250 fa
û=(xfb) # 251 fb
ü=(xfc) # 252 fc
ý=(xfd) # 253 fd
þ=(xfe) # 254 fe
é=(xe9)
ê=(xea)
ë=(xeb)
ì=(xec)
í=(xed)
î=(xee)
ï=(xef)
'''
>>> print HTM_Decoder (test_string)
ø=(xf8) # 248 f8
ù=(xf9) # 249 f9
ú=(xfa) # 250 fa
û=(xfb) # 251 fb
ü=(xfc) # 252 fc
ý=(xfd) # 253 fd
þ=(xfe) # 254 fe
é=(xe9)
ê=(xea)
ë=(xeb)
ì=(xec)
í=(xed)
î=(xee)
ï=(xef)
Another feature of SE is modularity.
>>> strip_tags = '''
~<(.|\x0a)*?>~=(9) # one tag to one tab
~<!--(.|\x0a)*?-->~=(9) # one comment to one tab
| # run
"~\x0a[ \x09\x0d\x0a]*~=(x0a)" # delete empty lines
~\t+~=(32) # one or more tabs to one space
~\x20\t+~=(32) # one space and one or more tabs to
one space
~\t+\x20~=(32) # one or more tab and one space to
one space
'''
>>> HTM_Stripper_Decoder = SE.SE (strip_tags + ' HTM2ISO.se ') #
Order doesn't matter
If you write 'strip_tags' to a file, say 'STRIP_TAGS.se' you'd name it
together with HTM2ISO.se:
>>> HTM_Stripper_Decoder = SE.SE ('STRIP_TAGS.se HTM2ISO.se') #
Order doesn't matter
Or, if you have two SE objects, one for stripping tags and one for
decoding the ampersands, you can nest them like this:
>>> test_string = "<p class=MsoNormal
style='line-height:110%'><i>René</i> est un garçon qui
paraît plus âgé. </p>"
>>> print Tag_Stripper (HTM_Decoder (test_string))
René est un garçon qui paraît plus âgé.
Nesting works with file names too, because file names are returned:
>>> Tag_Stripper (HTM_Decoder ('input_file_name'), 'output_file_name')
'output_file_name'
Frederic
More information about the Python-list
mailing list