[Patches] [ python-Patches-590682 ] New codecs: html, asciihtml

Thu, 12 Dec 2002 02:11:23 -0800

Patches item #590682, was opened at 2002-08-04 06:58
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=590682&group_id=5470

Category: None
Group: None
Status: Open
Resolution: None
Priority: 3
Submitted By: Oren Tirosh (orenti)
Assigned to: M.-A. Lemburg (lemburg)
Summary: New codecs: html, asciihtml

Initial Comment:
These codecs translate HTML character &entity; 
references.

The html codec may be applied after other codecs such 
as utf-8 or iso8859_X and preserves their encoding.  The 
asciihtml encoder produces 7-bit ascii and its output is 
therefore safe for insertion into almost any document 
regardless of its encoding.

----------------------------------------------------------------------

>Comment By: Martin v. Löwis (loewis)
Date: 2002-12-12 11:11

Message:
Logged In: YES 
user_id=21627

Oren, is this patch still needed, as we now have the 
xmlcharrefreplace error handler?

----------------------------------------------------------------------

Comment By: Oren Tirosh (orenti)
Date: 2002-08-09 17:38

Message:
Logged In: YES 
user_id=562624

Case insensitivity fixed. General cleanup.  Codecs renamed to 
htmlescape and htmlescape8bit.  Improved error handling. 
Update unicode_test.

----------------------------------------------------------------------

Comment By: Oren Tirosh (orenti)
Date: 2002-08-05 14:11

Message:
Logged In: YES 
user_id=562624

Yes, entities are supposed to be case sensitive but I'm 
working with manually-generated html in which &GT; is not so 
uncommon...  I guess life is different in XML world.
Case-smashing loses the distinction between some entities. I 
guess I need a more intelligent solution.

> If you apply it to an 8-bit UTF-8 encoded strings you'll get 
garbage!

Actually, it works great. The html codec passes characters 
128-255 unmodified and therefore can be chained with other 
codecs.  But I now have a more elegant and high-performance 
approach than codec chaining. See my python-dev posting. 

----------------------------------------------------------------------

Comment By: Oren Tirosh (orenti)
Date: 2002-08-05 14:11

Message:
Logged In: YES 
user_id=562624

Yes, entities are supposed to be case sensitive but I'm 
working with manually-generated html in which &GT; is not so 
uncommon...  I guess life is different in XML world.
Case-smashing loses the distinction between some entities. I 
guess I need a more intelligent solution.

> If you apply it to an 8-bit UTF-8 encoded strings you'll get 
garbage!

Actually, it works great. The html codec passes characters 
128-255 unmodified and therefore can be chained with other 
codecs.  But I now have a more elegant and high-performance 
approach than codec chaining. See my python-dev posting. 

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-08-05 09:59

Message:
Logged In: YES 
user_id=38388

On the htmlentitydefs: yes, these are in use as they are 
defined now. If you want a mapping from and to Unicode,
I'd suggest to provide this as a new table. About the cased
key in the entitydefs dict: AFAIK, these have to be cased since
entities are case-sensitive. Could be wrong though.

On PEP 293: this is going in the final round now. Your patch 
doesn't compete with it though, since PEP 293 is a much more 
general  approach.

On the general idea: I think the codecs are misnamed. They
should
be called htmlescape and asciihtmlescape since they don't
provide
"real" HTML encoding/decoding as Martin already mentioned. 
There's something wrong with your approach, BTW: the codec
should only operate on Unicode (taking only Unicode input
and generating Unicode). If you apply it to an 8-bit
UTF-8 encoded strings you'll get garbage !

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-08-04 17:54

Message:
Logged In: YES 
user_id=21627

I'm in favour of exposing this via a search functions, for
generated codec names, on top of PEP 293 (I would not like
your codec to compete with the alternative mechanism). My
dislike for the current patch also comes from the fact that
it singles-out ASCII, which the search function would not.

You could implement two forms: html.codecname and
xml.codecname. The html form would do HTML entity references
in both directions, and fall back to character references
only if necessary; the XML form would use character
references all the time, and entity references only for the
builtin entities.

And yes, I do recommend users to use codecs.charmap_encode
directly, as this is probably the most efficient, yet most
compact way to convert Unicode to a less-than-7-bit form.

In anycase, I'd encourage you to contribute to the progress
of PEP 293 first - this has been an issue for several years
now, and I would be sorry if it would fail.

While you are waiting for PEP 293 to complete, please do
consider cleaning up htmlentitydefs to provide mappings from
and to Unicode characters.

----------------------------------------------------------------------

Comment By: Oren Tirosh (orenti)
Date: 2002-08-04 17:07

Message:
Logged In: YES 
user_id=562624

>People may be tricked into believing that they can 
>decode arbitrary HTML with your codec - when your 
>codec would incorrectly deal with CDATA sections.

You don't even need to go as far as CDATA to see that tags 
must be parsed first and only then tag bodies and attribute 
values can be individually decoded. If you do it in the reverse 
order the tag parser will try to parse &lt; as a tag. It should be 
documented, though.

For encoding it's also obvious that encoding must be done 
first and then the encoded strings can be inserted into tags - 
< in strings is encoded into &lt; preventing it from being 
interpreted as a tag. This is a good thing! it prevents insertion 
attacks.

> You can easily enough arrange to get errors on <, >, 
> and &, by using codecs.charmap_encode with an 
> appropriate encoding map.

If you mean to use this as some internal implementation 
detail it's ok. Are actually proposing that this is the way end 
users should use it?

How about this:

Install an encoder registry function that responds to any 
codec name matching "xmlcharref.SPAM" and does all the 
internal magic you describe to create a codec instance that 
combines xmlcharref translation including <,>,& and the 
SPAM encoding. This dynamically-generated codec will do 
both encoding and decoding and be cached, of course.

"Namespaces are one honking great idea -- let's do more of 
those!"

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-08-04 13:50

Message:
Logged In: YES 
user_id=21627

You can easily enough arrange to get errors on <, >, amd &,
by using codecs.charmap_encode with an appropriate encoding map.

Infact, with that, you can easily get all entity refereces
into the encoded data, without any need for an explicit
iteration.

However, I am concerned that you offer decoding as well.
People may be tricked into believing that they can decode
arbitrrary HTML with your codec - when your codec would
incorrectly deal with CDATA sections.

----------------------------------------------------------------------

Comment By: Oren Tirosh (orenti)
Date: 2002-08-04 13:10

Message:
Logged In: YES 
user_id=562624

PEP 293 and patch #432401 are not a replacement for these 
codecs - it does decoding as well as encoding and also 
translates <, >, and & which are valid in all encodings and 
therefore won't get translated by error callbacks.

----------------------------------------------------------------------

Comment By: Oren Tirosh (orenti)
Date: 2002-08-04 13:00

Message:
Logged In: YES 
user_id=562624

Yes, the error callback approach handles strange mixes 
better than my method of chaining codecs. But it only does 
encoding - this patch also provides full decoding of named, 
decimal and hexadecimal character entity references.

Assuming PEP 293 is accepted, I'd like to see the asciihtml 
codec stay for its decoding ability and renamed to xmlcharref. 
The encoding part of this codec can just call .encode("ascii", 
errors="xmlcharrefreplace") to make it a full two-way codec.

I'd prefer htmlentitydefs.py to use unicode, too. It's not so 
useful the way it is.  Another problem is that it uses mixed 
case names as keys. The dictionary lookup is likely to miss 
incoming entities with arbitrary case so it's more-or-less 
broken. Does anyone actually use it the way it is? Can it be 
changed to use unicode without breaking anyone's code?

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-08-04 10:54

Message:
Logged In: YES 
user_id=21627

This patch is superceded by PEP 293 and patch #432401, which
allows you to write

unitext.encode("ascii", errors = "xmlcharrefreplace")

This probably should be left open until PEP 293 is
pronounced upon, and then either rejected or reviewed in detail.

I'd encourage a patch that uses Unicode in htmlentitydefs
directly, and computes entitydefs from that, instead of
vice-versa (or atleast exposes a unicode_entitydefs, perhaps
even lazily) - perhaps also with a reverse mapping.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=590682&group_id=5470