Re: [ python-Patches-590682 ] New codecs: html, asciihtml
![](https://secure.gravatar.com/avatar/8dc3438dfc509baa130f1444fdeacef2.jpg?s=120&d=mm&r=g)
(I'm moving this to python-dev) On Sun, Aug 04, 2002 at 08:54:05AM -0700, noreply@sourceforge.net wrote:
Comment By: Martin v. Löwis (loewis) Date: 2002-08-04 17:54
I'm in favour of exposing this via a search functions, for generated codec names, on top of PEP 293 (I would not like your codec to compete with the alternative mechanism). My dislike for the current patch also comes from the fact that it singles-out ASCII, which the search function would not.
I find PEP 293 too complex while my solution is, admittedly, too simplistic. Some of my reservations about PEP 293: It overloads the meaning of the error handling argument in an unintuitive way. It gets to the point where it's much more than just error handling - it's actually extending the functionality of the codec. Why implement yet another name-based registry? There must be a simpler way to do it. Generating an exception for each character that isn't handled by simple lookup probably adds quite a lot of overhead. What are the use cases? Maybe a simple extension to charmap would be enough for all the practical cases?
In anycase, I'd encourage you to contribute to the progress of PEP 293 first - this has been an issue for several years now, and I would be sorry if it would fail.
Me too. But if you really don't want it to be rejected you should try to find a way to make it simpler.
While you are waiting for PEP 293 to complete, please do consider cleaning up htmlentitydefs to provide mappings from and to Unicode characters.
No problem. The question is whether anyone depends on its current form. My proposed changes: 1. Use all lowercase entity names as keys. 2. Map "entityname" to u"\uXXXX" (currently it's mapped to "nnnn;") In its current form I find htmlentitydefs.py pretty useless. Names in the input in arbitrary case will not match the MixedCase keys in the entitydefs dictionary and the decimal character reference isn't really more useful than the named entity reference. Oren
![](https://secure.gravatar.com/avatar/3acb8bae5a2b5a28f6fe522a4ea9b873.jpg?s=120&d=mm&r=g)
Oren Tirosh <oren-py-d@hishome.net> writes:
It overloads the meaning of the error handling argument in an unintuitive way. It gets to the point where it's much more than just error handling - it's actually extending the functionality of the codec.
Isn't that precisely the meaning fo "to handle"? 3 : to act on or perform a required function with regard to <handle the day's mail> It produces a replacement text, just in the same way as "ignore" or "replace" produce replacement texts.
Why implement yet another name-based registry?
Namespaces are one honking great idea -- let's do more of those!
There must be a simpler way to do it.
Propose one.
What are the use cases? Maybe a simple extension to charmap would be enough for all the practical cases?
The primary use case is XML: how do you efficiently use xml charrefs. Notice that you can *not* use the charmap codec, since the underlying encoding may not be based on the charmap codec. In addition, it allows to give a more detailed analysis of an encoding error, as it exposes the string position where the error occurs. This allows to determine a "best" encoding (i.e. one that needs the fewest amounts of exceptions, or the one that has the longest sequences of same encodings).
Me too. But if you really don't want it to be rejected you should try to find a way to make it simpler.
Can you please elaborate why you think this is difficult? Is this a concern about - the implementation of the PEP, or - the implementation of error handlers, or - the usage of error handlers? I couldn't really believe that you find usage of this feature difficult: just pass an error handling string to your codec just as you currently do.
While you are waiting for PEP 293 to complete, please do consider cleaning up htmlentitydefs to provide mappings from and to Unicode characters.
No problem. The question is whether anyone depends on its current form. My proposed changes:
1. Use all lowercase entity names as keys.
That is probably a bad idea. Atleast for XHTML, the case of entity references is normative. Even for HTML 4, it would be good if this precisely matches the DTD. You could provide a case-insensitive lookup function in addition.
2. Map "entityname" to u"\uXXXX" (currently it's mapped to "nnnn;")
I think htmlentitydefs.entitydefs must stay as-is, for compatibility. Instead, I'd suggest to add additional objects/functions. Of course, the data should be present only once - all other functions/dictionaries could be derived.
In its current form I find htmlentitydefs.py pretty useless. Names in the input in arbitrary case will not match the MixedCase keys in the entitydefs dictionary and the decimal character reference isn't really more useful than the named entity reference.
Indeed. However, people probably rely on its specific contents, so any more useful access to the data must preserve entitydefs in its current form. Regards, Martin
![](https://secure.gravatar.com/avatar/12362ecee4672f1dd2d641ce5b4eca14.jpg?s=120&d=mm&r=g)
Oren Tirosh wrote:
(I'm moving this to python-dev)
I've already answered on the SF tracker. Won't repeat things here. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
![](https://secure.gravatar.com/avatar/ecc9bee2b1efd190a4d9cb6f55d28a96.jpg?s=120&d=mm&r=g)
Oren Tirosh wrote:
In its current form I find htmlentitydefs.py pretty useless.
I use it a lot, and find it reasonably useful. sure beats typing in the HTML character tables myself, or writing a DTD parser.
Names in the input in arbitrary case will not match the MixedCase keys in the entitydefs dictionary
people who use oddball characters may prefer to keep uppercase letters separate from lowercase letters. if I type "Linköping" using a named entity, I don't want it to come out as "LinkÖping". if you don't care, nothing stops you from using the "lower" string method.
and the decimal character reference isn't really more useful than the named entity reference.
really? converting a decimal character reference to a unicode character is trivial, but how do you convert a named entity reference to a unicode character? (look it up in the htmlentitydefs?) here's a trivial piece of code that converts the entitydefs dictionary to a entity->unicode mapping: entitydefs_unicode = {} for entity, char in entitydefs.items(): if char[:2] == "": char = unichr(int(char[2:-1])) else: char = unicode(char, "iso-8859-1") entitydefs_unicode[entity] = char </F>
![](https://secure.gravatar.com/avatar/8dc3438dfc509baa130f1444fdeacef2.jpg?s=120&d=mm&r=g)
On Mon, Aug 05, 2002 at 03:57:10PM +0200, Fredrik Lundh wrote:
and the decimal character reference isn't really more useful than the named entity reference.
really? converting a decimal character reference to a unicode character is trivial, but how do you convert a named entity reference to a unicode character? (look it up in the htmlentitydefs?)
here's a trivial piece of code that converts the entitydefs dictionary to a entity->unicode mapping:
entitydefs_unicode = {} for entity, char in entitydefs.items(): if char[:2] == "": char = unichr(int(char[2:-1])) else: char = unicode(char, "iso-8859-1") entitydefs_unicode[entity] = char
Sure it's trivial but why should I be forced to do this conversion? I'm sorry if I didn't explain myself so well. What I meant is not that the entitydefs dictionary is useless but that decimal character references are not useful by themselves - they are just another intermediate form. Why does the dictionary convert from "α" to "α" and not to the fully decoded form which is the single unicode character u'\u03b1'? I can't think of a case where numeric references are really useful by themselves and not as some intermediate form. Browsers understand "α" and "α" equally well. Humans find the named references easier to understand. Processing programs can't understand "α" without first isolating the digits and converting them to a number. About case sensitivity you're right - smashing case does lose some information. If a parser needs to understand sloppy manually-generated HTML with tags like ">" it should be a little smarter than that. Oren
![](https://secure.gravatar.com/avatar/12362ecee4672f1dd2d641ce5b4eca14.jpg?s=120&d=mm&r=g)
Oren Tirosh wrote:
On Mon, Aug 05, 2002 at 03:57:10PM +0200, Fredrik Lundh wrote:
and the decimal character reference isn't really more useful than the named entity reference.
really? converting a decimal character reference to a unicode character is trivial, but how do you convert a named entity reference to a unicode character? (look it up in the htmlentitydefs?)
here's a trivial piece of code that converts the entitydefs dictionary to a entity->unicode mapping:
entitydefs_unicode = {} for entity, char in entitydefs.items(): if char[:2] == "": char = unichr(int(char[2:-1])) else: char = unicode(char, "iso-8859-1") entitydefs_unicode[entity] = char
Sure it's trivial but why should I be forced to do this conversion?
Maybe because users of htmlentitydefs don't want to pay for the extra table even though they don't use it ?
I'm sorry if I didn't explain myself so well. What I meant is not that the entitydefs dictionary is useless but that decimal character references are not useful by themselves - they are just another intermediate form. Why does the dictionary convert from "α" to "α" and not to the fully decoded form which is the single unicode character u'\u03b1'?
Because that only works for Unicode and not all applications are written to work with Unicode. The table maps entities to Latin-1 which is HTML's default encoding.
I can't think of a case where numeric references are really useful by themselves and not as some intermediate form. Browsers understand "α" and "α" equally well. Humans find the named references easier to understand. Processing programs can't understand "α" without first isolating the digits and converting them to a number.
About case sensitivity you're right - smashing case does lose some information. If a parser needs to understand sloppy manually-generated HTML with tags like ">" it should be a little smarter than that.
That is application specific. The htmlentitydefs were generated from the HTML spec files themselves, so they provide the basics needed to work from. It's easy enough for you to write a function which translates the basic table into anything you need. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
participants (4)
-
Fredrik Lundh
-
M.-A. Lemburg
-
martin@v.loewis.de
-
Oren Tirosh