[Python-Dev] forwarded message from aahz@panix.com

Thu, 5 Oct 2000 16:13:34 -0400 (EDT)

>>>>> "AM" == aahz  <aahz@panix.com> writes:

  >> [Aahz]
  >>> Someone just pointed out on c.l.py that we need an HTMLescape()
  >>> function that takes a string and converts special characters to
  >>> entities.  I'm not on python-dev, so could you please forward
  >>> this and find out whether I need to run a PEP?
  >>
  >> Has someone pointed out yet that this is done by cgi.escape()?

  AM> Yeah, I missed that earlier.  But after thinking some more,
  AM> there are a fair number of browser-like bits of software that
  AM> fail to render many of the special characters correctly
  AM> (e.g. trademark).  This is frequently due to character set
  AM> issues; entities almost always render correctly, though.
  AM> Therefore a general translation routine is probably handy.

  AM> cgi.escape() only handles "&", "<", ">".  I'm not sure whether
  AM> cgi.escape ought to be expanded to handle all characters or a
  AM> new routine should be added.  Martin van Loewis suggested
  AM> xml.sax.saxutils.escape(), but I have zero familiarity with XML
  AM> and am waiting for 2.0final.  Perhaps this should be taken
  AM> off-line?

Perhaps we should take it to python-list -- or maybe we should form a
web-sig and work on it there.

There are definitely some tricky issues to work out.  I attempted to
work out some of the same issues for internationalization support in
Mailman's pipermail archives.

The escape function in cgi should stay minimal, because it deals with
the only truly essential characters.  If the browser interprets an
HTML page as iso-8859-1 ("Latin 1") then characters > chr(127) are
going to be rendered properly.  You can add an explicit meta tag to
the HTML page and the server will return the charset in the headers.

This seems quite a bit simpler than trying to escape all characters >
chr(127), except if you have to deal with old browsers that don't
support the charset specified by the HTTP header.

Jeremy