unescape HTML entities
Rares Vernica
rvernica at gmail.com
Wed Nov 1 19:53:01 EST 2006
Hi,
I downloades 2.2 beta, just to be sure I have the same version as you
specify. (The file names are no longer funny.) Anyway, it does not seem
to do as you said:
In [14]: import SE
In [15]: SE.version
-------> SE.version()
Out[15]: 'SE 2.2 beta - SEL 2.2 beta'
In [16]: HTM_Decoder = SE.SE ('HTM2ISO.se')
In [17]: test_string = '''
....: ø=(xf8) # 248 f8
....: ù=(xf9) # 249 f9
....: ú=(xfa) # 250 fa
....: û=(xfb) # 251 fb
....: ü=(xfc) # 252 fc
....: ý=(xfd) # 253 fd
....: þ=(xfe) # 254 fe
....: é=(xe9)
....: ê=(xea)
....: ë=(xeb)
....: ì=(xec)
....: í=(xed)
....: î=(xee)
....: ï=(xef)
....: '''
In [18]: print HTM_Decoder (test_string)
ø=(xf8) # 248 f8
ù=(xf9) # 249 f9
ú=(xfa) # 250 fa
û=(xfb) # 251 fb
ü=(xfc) # 252 fc
ý=(xfd) # 253 fd
þ=(xfe) # 254 fe
é=(xe9)
ê=(xea)
ë=(xeb)
ì=(xec)
í=(xed)
î=(xee)
ï=(xef)
In [19]:
Thanks,
Ray
Frederic Rentsch wrote:
> Rares Vernica wrote:
>> Hi,
>>
>> How can I unescape HTML entities like " "?
>>
>> I know about xml.sax.saxutils.unescape() but it only deals with "&",
>> "<", and ">".
>>
>> Also, I know about htmlentitydefs.entitydefs, but not only this
>> dictionary is the opposite of what I need, it does not have " ".
>>
>> It has to be in python 2.4.
>>
>> Thanks a lot,
>> Ray
>>
> One way is this:
>
> >>> import SE #
> Download from http://cheeseshop.python.org/pypi/SE/2.2%20beta
> >>> SE.SE ('HTM2ISO.se')('input_file_name', 'output_file_name') #
> HTM2ISO.se is included
> 'output_file_name'
>
> For repeated translations the SE object would be assigned to a variable:
>
> >>> HTM_Decoder = SE.SE ('HTM2ISO.se')
>
> SE objects take and return strings as well as file names which is useful
> for translating string variables, doing line-by-line translations and
> for interactive development or verification. A simple way to check a
> substitution set is to use its definitions as test data. The following
> is a section of the definition file HTM2ISO.se:
>
> test_string = '''
> ø=(xf8) # 248 f8
> ù=(xf9) # 249 f9
> ú=(xfa) # 250 fa
> û=(xfb) # 251 fb
> ü=(xfc) # 252 fc
> ý=(xfd) # 253 fd
> þ=(xfe) # 254 fe
> é=(xe9)
> ê=(xea)
> ë=(xeb)
> ì=(xec)
> í=(xed)
> î=(xee)
> ï=(xef)
> '''
>
> >>> print HTM_Decoder (test_string)
>
> ø=(xf8) # 248 f8
> ù=(xf9) # 249 f9
> ú=(xfa) # 250 fa
> û=(xfb) # 251 fb
> ü=(xfc) # 252 fc
> ý=(xfd) # 253 fd
> þ=(xfe) # 254 fe
> é=(xe9)
> ê=(xea)
> ë=(xeb)
> ì=(xec)
> í=(xed)
> î=(xee)
> ï=(xef)
>
> Another feature of SE is modularity.
>
> >>> strip_tags = '''
> ~<(.|\x0a)*?>~=(9) # one tag to one tab
> ~<!--(.|\x0a)*?-->~=(9) # one comment to one tab
> | # run
> "~\x0a[ \x09\x0d\x0a]*~=(x0a)" # delete empty lines
> ~\t+~=(32) # one or more tabs to one space
> ~\x20\t+~=(32) # one space and one or more tabs to
> one space
> ~\t+\x20~=(32) # one or more tab and one space to
> one space
> '''
>
> >>> HTM_Stripper_Decoder = SE.SE (strip_tags + ' HTM2ISO.se ') #
> Order doesn't matter
>
> If you write 'strip_tags' to a file, say 'STRIP_TAGS.se' you'd name it
> together with HTM2ISO.se:
>
> >>> HTM_Stripper_Decoder = SE.SE ('STRIP_TAGS.se HTM2ISO.se') #
> Order doesn't matter
>
> Or, if you have two SE objects, one for stripping tags and one for
> decoding the ampersands, you can nest them like this:
>
> >>> test_string = "<p class=MsoNormal
> style='line-height:110%'><i>René</i> est un garçon qui
> paraît plus âgé. </p>"
>
> >>> print Tag_Stripper (HTM_Decoder (test_string))
> René est un garçon qui paraît plus âgé.
>
> Nesting works with file names too, because file names are returned:
>
> >>> Tag_Stripper (HTM_Decoder ('input_file_name'), 'output_file_name')
> 'output_file_name'
>
>
> Frederic
>
>
>
More information about the Python-list
mailing list