[Python-Dev] Re: [ python-Patches-590682 ] New codecs: html, asciihtml

Oren Tirosh oren-py-d@hishome.net
Sun, 4 Aug 2002 21:30:46 +0300

(I'm moving this to python-dev)

On Sun, Aug 04, 2002 at 08:54:05AM -0700, noreply@sourceforge.net wrote:
> >Comment By: Martin v. L÷wis (loewis)
> Date: 2002-08-04 17:54
> I'm in favour of exposing this via a search functions, for
> generated codec names, on top of PEP 293 (I would not like
> your codec to compete with the alternative mechanism). My
> dislike for the current patch also comes from the fact that
> it singles-out ASCII, which the search function would not.

I find PEP 293 too complex while my solution is, admittedly, too 

Some of my reservations about PEP 293:

It overloads the meaning of the error handling argument in an unintuitive
way.  It gets to the point where it's much more than just error handling - 
it's actually extending the functionality of the codec. 

Why implement yet another name-based registry?  There must be a simpler way 
to do it.

Generating an exception for each character that isn't handled by simple 
lookup probably adds quite a lot of overhead.

What are the use cases?  Maybe a simple extension to charmap would be enough 
for all the practical cases?

> In anycase, I'd encourage you to contribute to the progress
> of PEP 293 first - this has been an issue for several years
> now, and I would be sorry if it would fail.

Me too.  But if you really don't want it to be rejected you should try to
find a way to make it simpler.

> While you are waiting for PEP 293 to complete, please do
> consider cleaning up htmlentitydefs to provide mappings from
> and to Unicode characters.

No problem.  The question is whether anyone depends on its current form.  
My proposed changes:

1. Use all lowercase entity names as keys.
2. Map "entityname" to u"\uXXXX" (currently it's mapped to "&#nnnn;")

In its current form I find htmlentitydefs.py pretty useless. Names in the
input in arbitrary case will not match the MixedCase keys in the entitydefs 
dictionary and the decimal character reference isn't really more useful than 
the named entity reference.