[Patches] [ python-Patches-590682 ] New codecs: html, asciihtml
noreply@sourceforge.net
noreply@sourceforge.net
Thu, 12 Dec 2002 02:11:23 -0800
Patches item #590682, was opened at 2002-08-04 06:58
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=590682&group_id=5470
Category: None
Group: None
Status: Open
Resolution: None
Priority: 3
Submitted By: Oren Tirosh (orenti)
Assigned to: M.-A. Lemburg (lemburg)
Summary: New codecs: html, asciihtml
Initial Comment:
These codecs translate HTML character &entity;
references.
The html codec may be applied after other codecs such
as utf-8 or iso8859_X and preserves their encoding. The
asciihtml encoder produces 7-bit ascii and its output is
therefore safe for insertion into almost any document
regardless of its encoding.
----------------------------------------------------------------------
>Comment By: Martin v. Löwis (loewis)
Date: 2002-12-12 11:11
Message:
Logged In: YES
user_id=21627
Oren, is this patch still needed, as we now have the
xmlcharrefreplace error handler?
----------------------------------------------------------------------
Comment By: Oren Tirosh (orenti)
Date: 2002-08-09 17:38
Message:
Logged In: YES
user_id=562624
Case insensitivity fixed. General cleanup. Codecs renamed to
htmlescape and htmlescape8bit. Improved error handling.
Update unicode_test.
----------------------------------------------------------------------
Comment By: Oren Tirosh (orenti)
Date: 2002-08-05 14:11
Message:
Logged In: YES
user_id=562624
Yes, entities are supposed to be case sensitive but I'm
working with manually-generated html in which > is not so
uncommon... I guess life is different in XML world.
Case-smashing loses the distinction between some entities. I
guess I need a more intelligent solution.
> If you apply it to an 8-bit UTF-8 encoded strings you'll get
garbage!
Actually, it works great. The html codec passes characters
128-255 unmodified and therefore can be chained with other
codecs. But I now have a more elegant and high-performance
approach than codec chaining. See my python-dev posting.
----------------------------------------------------------------------
Comment By: Oren Tirosh (orenti)
Date: 2002-08-05 14:11
Message:
Logged In: YES
user_id=562624
Yes, entities are supposed to be case sensitive but I'm
working with manually-generated html in which > is not so
uncommon... I guess life is different in XML world.
Case-smashing loses the distinction between some entities. I
guess I need a more intelligent solution.
> If you apply it to an 8-bit UTF-8 encoded strings you'll get
garbage!
Actually, it works great. The html codec passes characters
128-255 unmodified and therefore can be chained with other
codecs. But I now have a more elegant and high-performance
approach than codec chaining. See my python-dev posting.
----------------------------------------------------------------------
Comment By: M.-A. Lemburg (lemburg)
Date: 2002-08-05 09:59
Message:
Logged In: YES
user_id=38388
On the htmlentitydefs: yes, these are in use as they are
defined now. If you want a mapping from and to Unicode,
I'd suggest to provide this as a new table. About the cased
key in the entitydefs dict: AFAIK, these have to be cased since
entities are case-sensitive. Could be wrong though.
On PEP 293: this is going in the final round now. Your patch
doesn't compete with it though, since PEP 293 is a much more
general approach.
On the general idea: I think the codecs are misnamed. They
should
be called htmlescape and asciihtmlescape since they don't
provide
"real" HTML encoding/decoding as Martin already mentioned.
There's something wrong with your approach, BTW: the codec
should only operate on Unicode (taking only Unicode input
and generating Unicode). If you apply it to an 8-bit
UTF-8 encoded strings you'll get garbage !
----------------------------------------------------------------------
Comment By: Martin v. Löwis (loewis)
Date: 2002-08-04 17:54
Message:
Logged In: YES
user_id=21627
I'm in favour of exposing this via a search functions, for
generated codec names, on top of PEP 293 (I would not like
your codec to compete with the alternative mechanism). My
dislike for the current patch also comes from the fact that
it singles-out ASCII, which the search function would not.
You could implement two forms: html.codecname and
xml.codecname. The html form would do HTML entity references
in both directions, and fall back to character references
only if necessary; the XML form would use character
references all the time, and entity references only for the
builtin entities.
And yes, I do recommend users to use codecs.charmap_encode
directly, as this is probably the most efficient, yet most
compact way to convert Unicode to a less-than-7-bit form.
In anycase, I'd encourage you to contribute to the progress
of PEP 293 first - this has been an issue for several years
now, and I would be sorry if it would fail.
While you are waiting for PEP 293 to complete, please do
consider cleaning up htmlentitydefs to provide mappings from
and to Unicode characters.
----------------------------------------------------------------------
Comment By: Oren Tirosh (orenti)
Date: 2002-08-04 17:07
Message:
Logged In: YES
user_id=562624
>People may be tricked into believing that they can
>decode arbitrary HTML with your codec - when your
>codec would incorrectly deal with CDATA sections.
You don't even need to go as far as CDATA to see that tags
must be parsed first and only then tag bodies and attribute
values can be individually decoded. If you do it in the reverse
order the tag parser will try to parse < as a tag. It should be
documented, though.
For encoding it's also obvious that encoding must be done
first and then the encoded strings can be inserted into tags -
< in strings is encoded into < preventing it from being
interpreted as a tag. This is a good thing! it prevents insertion
attacks.
> You can easily enough arrange to get errors on <, >,
> and &, by using codecs.charmap_encode with an
> appropriate encoding map.
If you mean to use this as some internal implementation
detail it's ok. Are actually proposing that this is the way end
users should use it?
How about this:
Install an encoder registry function that responds to any
codec name matching "xmlcharref.SPAM" and does all the
internal magic you describe to create a codec instance that
combines xmlcharref translation including <,>,& and the
SPAM encoding. This dynamically-generated codec will do
both encoding and decoding and be cached, of course.
"Namespaces are one honking great idea -- let's do more of
those!"
----------------------------------------------------------------------
Comment By: Martin v. Löwis (loewis)
Date: 2002-08-04 13:50
Message:
Logged In: YES
user_id=21627
You can easily enough arrange to get errors on <, >, amd &,
by using codecs.charmap_encode with an appropriate encoding map.
Infact, with that, you can easily get all entity refereces
into the encoded data, without any need for an explicit
iteration.
However, I am concerned that you offer decoding as well.
People may be tricked into believing that they can decode
arbitrrary HTML with your codec - when your codec would
incorrectly deal with CDATA sections.
----------------------------------------------------------------------
Comment By: Oren Tirosh (orenti)
Date: 2002-08-04 13:10
Message:
Logged In: YES
user_id=562624
PEP 293 and patch #432401 are not a replacement for these
codecs - it does decoding as well as encoding and also
translates <, >, and & which are valid in all encodings and
therefore won't get translated by error callbacks.
----------------------------------------------------------------------
Comment By: Oren Tirosh (orenti)
Date: 2002-08-04 13:00
Message:
Logged In: YES
user_id=562624
Yes, the error callback approach handles strange mixes
better than my method of chaining codecs. But it only does
encoding - this patch also provides full decoding of named,
decimal and hexadecimal character entity references.
Assuming PEP 293 is accepted, I'd like to see the asciihtml
codec stay for its decoding ability and renamed to xmlcharref.
The encoding part of this codec can just call .encode("ascii",
errors="xmlcharrefreplace") to make it a full two-way codec.
I'd prefer htmlentitydefs.py to use unicode, too. It's not so
useful the way it is. Another problem is that it uses mixed
case names as keys. The dictionary lookup is likely to miss
incoming entities with arbitrary case so it's more-or-less
broken. Does anyone actually use it the way it is? Can it be
changed to use unicode without breaking anyone's code?
----------------------------------------------------------------------
Comment By: Martin v. Löwis (loewis)
Date: 2002-08-04 10:54
Message:
Logged In: YES
user_id=21627
This patch is superceded by PEP 293 and patch #432401, which
allows you to write
unitext.encode("ascii", errors = "xmlcharrefreplace")
This probably should be left open until PEP 293 is
pronounced upon, and then either rejected or reviewed in detail.
I'd encourage a patch that uses Unicode in htmlentitydefs
directly, and computes entitydefs from that, instead of
vice-versa (or atleast exposes a unicode_entitydefs, perhaps
even lazily) - perhaps also with a reverse mapping.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=590682&group_id=5470