[Python-Dev] Re: [ python-Patches-590682 ] New codecs: html, asciihtml

04 Aug 2002 21:30:06 +0200

Oren Tirosh <oren-py-d@hishome.net> writes:

> It overloads the meaning of the error handling argument in an
> unintuitive way.  It gets to the point where it's much more than
> just error handling - it's actually extending the functionality of
> the codec.

Isn't that precisely the meaning fo "to handle"?

3 : to act on or perform a required function with regard to 
   <handle the day's mail>

It produces a replacement text, just in the same way as "ignore" or
"replace" produce replacement texts.

> Why implement yet another name-based registry?  

Namespaces are one honking great idea -- let's do more of those!

> There must be a simpler way to do it.

Propose one.

> What are the use cases?  Maybe a simple extension to charmap would
> be enough for all the practical cases?

The primary use case is XML: how do you efficiently use xml charrefs.
Notice that you can *not* use the charmap codec, since the underlying
encoding may not be based on the charmap codec.

In addition, it allows to give a more detailed analysis of an encoding
error, as it exposes the string position where the error occurs. This
allows to determine a "best" encoding (i.e. one that needs the fewest
amounts of exceptions, or the one that has the longest sequences of
same encodings).

> Me too.  But if you really don't want it to be rejected you should
> try to find a way to make it simpler.

Can you please elaborate why you think this is difficult? Is this a
concern about 
- the implementation of the PEP, or
- the implementation of error handlers, or
- the usage of error handlers?

I couldn't really believe that you find usage of this feature
difficult: just pass an error handling string to your codec just as
you currently do.

> 
> > While you are waiting for PEP 293 to complete, please do
> > consider cleaning up htmlentitydefs to provide mappings from
> > and to Unicode characters.
> 
> No problem.  The question is whether anyone depends on its current form.  
> My proposed changes:
> 
> 1. Use all lowercase entity names as keys.

That is probably a bad idea. Atleast for XHTML, the case of entity
references is normative. Even for HTML 4, it would be good if this
precisely matches the DTD.

You could provide a case-insensitive lookup function in addition.

> 2. Map "entityname" to u"\uXXXX" (currently it's mapped to "&#nnnn;")

I think htmlentitydefs.entitydefs must stay as-is, for
compatibility. Instead, I'd suggest to add additional
objects/functions. Of course, the data should be present only once -
all other functions/dictionaries could be derived.

> In its current form I find htmlentitydefs.py pretty useless. Names in the
> input in arbitrary case will not match the MixedCase keys in the entitydefs 
> dictionary and the decimal character reference isn't really more useful than 
> the named entity reference. 

Indeed. However, people probably rely on its specific contents, so any
more useful access to the data must preserve entitydefs in its current
form.

Regards,
Martin