unescape HTML entities

Frederic Rentsch anthra.norell at vtxmail.ch
Thu Nov 2 15:37:18 CET 2006


Rares Vernica wrote:
> Hi,
>
> I downloades 2.2 beta, just to be sure I have the same version as you 
> specify. (The file names are no longer funny.) Anyway, it does not seem 
> to do as you said:
>
> In [14]: import SE
>
> In [15]: SE.version
> -------> SE.version()
> Out[15]: 'SE 2.2 beta - SEL 2.2 beta'
>
> In [16]: HTM_Decoder = SE.SE ('HTM2ISO.se')
>
> In [17]: test_string = '''
>     ....: ø=(xf8)   #  248  f8
>     ....: ù=(xf9)   #  249  f9
>     ....: ú=(xfa)   #  250  fa
>     ....: û=(xfb)    #  251  fb
>     ....: ü=(xfc)     #  252  fc
>     ....: ý=(xfd)   #  253  fd
>     ....: þ=(xfe)    #  254  fe
>     ....: é=(xe9)
>     ....: ê=(xea)
>     ....: ë=(xeb)
>     ....: ì=(xec)
>     ....: í=(xed)
>     ....: î=(xee)
>     ....: ï=(xef)
>     ....: '''
>
> In [18]: print HTM_Decoder (test_string)
>
> ø=(xf8)   #  248  f8
> ù=(xf9)   #  249  f9
> ú=(xfa)   #  250  fa
> û=(xfb)    #  251  fb
> ü=(xfc)     #  252  fc
> ý=(xfd)   #  253  fd
> þ=(xfe)    #  254  fe
> é=(xe9)
> ê=(xea)
> ë=(xeb)
> ì=(xec)
> í=(xed)
> î=(xee)
> ï=(xef)
>
>
> In [19]:
>
> Thanks,
> Ray
>
>
>
> Frederic Rentsch wrote:
>   
>> Rares Vernica wrote:
>>     
>>> Hi,
>>>
>>> How can I unescape HTML entities like " "?
>>>
>>> I know about xml.sax.saxutils.unescape() but it only deals with "&", 
>>> "<", and ">".
>>>
>>> Also, I know about htmlentitydefs.entitydefs, but not only this 
>>> dictionary is the opposite of what I need, it does not have " ".
>>>
>>> It has to be in python 2.4.
>>>
>>> Thanks a lot,
>>> Ray
>>>
>>>       
>> One way is this:
>>
>>  >>> import SE                                                      # 
>> Download from http://cheeseshop.python.org/pypi/SE/2.2%20beta
>>  >>> SE.SE ('HTM2ISO.se')('input_file_name', 'output_file_name')    # 
>> HTM2ISO.se is included
>> 'output_file_name'
>>
>> For repeated translations the SE object would be assigned to a variable:
>>
>>  >>> HTM_Decoder = SE.SE ('HTM2ISO.se')
>>
>> SE objects take and return strings as well as file names which is useful 
>> for translating string variables, doing line-by-line translations and 
>> for interactive development or verification. A simple way to check a 
>> substitution set is to use its definitions as test data. The following 
>> is a section of the definition file HTM2ISO.se:
>>
>> test_string = '''
>> ø=(xf8)   #  248  f8
>> ù=(xf9)   #  249  f9
>> ú=(xfa)   #  250  fa
>> û=(xfb)    #  251  fb
>> ü=(xfc)     #  252  fc
>> ý=(xfd)   #  253  fd
>> þ=(xfe)    #  254  fe
>> é=(xe9)
>> ê=(xea)
>> ë=(xeb)
>> ì=(xec)
>> í=(xed)
>> î=(xee)
>> ï=(xef)
>> '''
>>
>>  >>> print HTM_Decoder (test_string)
>>
>> ø=(xf8)   #  248  f8
>> ù=(xf9)   #  249  f9
>> ú=(xfa)   #  250  fa
>> û=(xfb)    #  251  fb
>> ü=(xfc)     #  252  fc
>> ý=(xfd)   #  253  fd
>> þ=(xfe)    #  254  fe
>> é=(xe9)
>> ê=(xea)
>> ë=(xeb)
>> ì=(xec)
>> í=(xed)
>> î=(xee)
>> ï=(xef)
>>
>> Another feature of SE is modularity.
>>
>>  >>> strip_tags = '''
>>    ~<(.|\x0a)*?>~=(9)               # one tag to one tab
>>    ~<!--(.|\x0a)*?-->~=(9)          # one comment to one tab
>> |                                   # run
>>    "~\x0a[ \x09\x0d\x0a]*~=(x0a)"   # delete empty lines
>>    ~\t+~=(32)                       # one or more tabs to one space
>>    ~\x20\t+~=(32)                   # one space and one or more tabs to 
>> one space
>>    ~\t+\x20~=(32)                   # one or more tab and one space to 
>> one space
>> '''
>>
>>  >>> HTM_Stripper_Decoder = SE.SE (strip_tags + ' HTM2ISO.se ')   # 
>> Order doesn't matter
>>
>> If you write 'strip_tags' to a file, say 'STRIP_TAGS.se' you'd name it 
>> together with HTM2ISO.se:
>>
>>  >>> HTM_Stripper_Decoder = SE.SE ('STRIP_TAGS.se  HTM2ISO.se')   # 
>> Order doesn't matter
>>
>> Or, if you have two SE objects, one for stripping tags and one for 
>> decoding the ampersands, you can nest them like this:
>>
>>  >>> test_string = "<p class=MsoNormal 
>> style='line-height:110%'><i>Ren&eacute;</i> est un gar&ccedil;on qui 
>> para&icirc;t plus &acirc;g&eacute;. </p>"
>>
>>  >>> print Tag_Stripper (HTM_Decoder (test_string))
>>   René est un garçon qui paraît plus âgé.
>>
>> Nesting works with file names too, because file names are returned:
>>
>>  >>> Tag_Stripper (HTM_Decoder ('input_file_name'), 'output_file_name')
>> 'output_file_name'
>>
>>
>> Frederic
>>
>>
>>
>>     
>
>   


Ray,

I am sorry you're having a problem. I cannot duplicate it. It works fine 
here. I suspect that SE.SE doesn't find your file HTM2ISO.SE. Do this:

 >>> HTM_Decoder = SE.SE ('HTM2ISO.SE')
 >>> HTM_Decoder.show_log ()

Thu Nov 02 15:15:39 2006 - Compiler - Ignoring single word 'HTM2ISO.SE'. 
Not an existing file 'HTM2ISO.SE'.

If you see this, then you might have forgotten to include the path with 
the file name.

Rather than getting an old version, you could just have renamed the to 
py-files. Version 2.3 has some minor bugs corrected. I fixed the names 
and tried to re-upload to the Cheese Shop and the damn thing stubbornly 
refuses the upload after having required that I delete the file I was 
going to replacing. So it isn't there anymore and the replacement isn't 
there yet. I'll be working on this. In the meantime I'll be happy to 
direct-mail V2.3 by request.

Frederic






More information about the Python-list mailing list