[Python-Dev] Re: [ python-Patches-590682 ] New codecs: html, asciihtml
Martin v. Loewis
04 Aug 2002 21:30:06 +0200
Oren Tirosh <email@example.com> writes:
> It overloads the meaning of the error handling argument in an
> unintuitive way. It gets to the point where it's much more than
> just error handling - it's actually extending the functionality of
> the codec.
Isn't that precisely the meaning fo "to handle"?
3 : to act on or perform a required function with regard to
<handle the day's mail>
It produces a replacement text, just in the same way as "ignore" or
"replace" produce replacement texts.
> Why implement yet another name-based registry?
Namespaces are one honking great idea -- let's do more of those!
> There must be a simpler way to do it.
> What are the use cases? Maybe a simple extension to charmap would
> be enough for all the practical cases?
The primary use case is XML: how do you efficiently use xml charrefs.
Notice that you can *not* use the charmap codec, since the underlying
encoding may not be based on the charmap codec.
In addition, it allows to give a more detailed analysis of an encoding
error, as it exposes the string position where the error occurs. This
allows to determine a "best" encoding (i.e. one that needs the fewest
amounts of exceptions, or the one that has the longest sequences of
> Me too. But if you really don't want it to be rejected you should
> try to find a way to make it simpler.
Can you please elaborate why you think this is difficult? Is this a
- the implementation of the PEP, or
- the implementation of error handlers, or
- the usage of error handlers?
I couldn't really believe that you find usage of this feature
difficult: just pass an error handling string to your codec just as
you currently do.
> > While you are waiting for PEP 293 to complete, please do
> > consider cleaning up htmlentitydefs to provide mappings from
> > and to Unicode characters.
> No problem. The question is whether anyone depends on its current form.
> My proposed changes:
> 1. Use all lowercase entity names as keys.
That is probably a bad idea. Atleast for XHTML, the case of entity
references is normative. Even for HTML 4, it would be good if this
precisely matches the DTD.
You could provide a case-insensitive lookup function in addition.
> 2. Map "entityname" to u"\uXXXX" (currently it's mapped to "&#nnnn;")
I think htmlentitydefs.entitydefs must stay as-is, for
compatibility. Instead, I'd suggest to add additional
objects/functions. Of course, the data should be present only once -
all other functions/dictionaries could be derived.
> In its current form I find htmlentitydefs.py pretty useless. Names in the
> input in arbitrary case will not match the MixedCase keys in the entitydefs
> dictionary and the decimal character reference isn't really more useful than
> the named entity reference.
Indeed. However, people probably rely on its specific contents, so any
more useful access to the data must preserve entitydefs in its current