Is there any thought to extending escape to escape / unescape to by default handle characters other than <, >, and &? At a minimum it should handle arbitrary &xxx; values. Ideally, it would also handle common other symbolic names besides &lt; &gt; etc.

HTML from common web sites such as nytimes.com frequently has a variety of characters escaped. 

Consider the page at http://travel.nytimes.com/travel/guides/europe/france/provence-and-the-french-riviera/overview.html

It lists its content type as:
content="text/html; charset=UTF-8"
And contains text like:
There&#146;s the C&ocirc;te d&#146;
Ideally, we would decode &#146 into ’ and &ocirc into ô.
Unfortunately, #146 is really an error -- it's not a utf-8 encoded unicode character but really a MS codepage 1252 character for apostrophe (apparently may HTML editing systems intermingle unicode and codepage 1252 content for apostrophes and a few other common characters).
I'm happy to contribute some additional code for these other cases if people agree it's useful.



On May 12, 2008, at 10:36 AM, Tony Nelson wrote:

At 11:56 PM -0400 5/10/08, Fred Drake wrote:
On May 10, 2008, at 11:49 PM, Guido van Rossum wrote:
Works for me. The other thing I always use from cgi is escape() --
will that be available somewhere else too?


xml.sax.saxutils.escape() would be an appropriate replacement, though
the location is a little funky.

At least it's right next to the valuable quoteattr().
--
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson@georgeanelson.com>
     '                              <http://www.georgeanelson.com/>
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/thomaspinckney3%40gmail.com