[I18n-sig] Proposal: Extended error handlingforunicode.encode

"Walter Dörwald" walter@livinglogic.de
Mon, 08 Jan 2001 19:25:15 +0100


On 03.01.01 at 21:17 M.-A. Lemburg wrote:

> [ Unicode compression example ]
> 
> > > I know that error handling could be more generic, but passing
> > > a callable object instead of the error parameter is not an
> > > option since the internal APIs all use a const char parameter
> > > for error.
> > 
> > Changing this should can be done in one or two hours for someone
> > who knows the Python internals. (Unfortunately I don't, I first
> > looked at unicodeobject.[hc] several days ago!)
> 
> Sure, but it would break code and alter the Python C API
> in unacceptable ways. Note that all builtin C codecs would
> also have to be changed.
> 
> If we are going to extend the error handling mechanism then
> we'd better do it some b/w compatible way, e.g. by providing
> new APIs.

But I don't think that can be done in a completely backward
compatible way. At least the codecs will have to be changed.

> [...]
>
> > extern DL_IMPORT(PyObject*) PyUnicode_Encode(
> >      const Py_UNICODE *s,        /* Unicode char buffer */
> >      int size,                   /* number of Py_UNICODE chars to=
 encode */
> >      const char *encoding,       /* encoding */
> >      PyUnicodeObject *errorHandler(PyUnicodeObject *string, int=
 position) /* error handling via C function */
> >      );
> > would, but thats not the point. When you use an encoding, where more
> > than 20% of the characters have to be escaped (as XML entities or=
 whatever)
> > you're using the wrong encoding.
> 
> That's what I was talking about all along... if it's really
> only for escaping XML, then a special Latin-1 or ASCII XML excaping
> codec would go a long way (without the troubles of using callbacks
> and without having to add a new error callback mechanism).

But I would like to hav and escaping mechanism that can
be used with any encoding, not just latin1 + xml-escape,
and ascii + xml-escape, but also shift-jis + xml-escape,
euc + xml-escape, koi8 + xml-escape, ...

> Writing such a codec doesn't take much time, since the
> code's already there. Even better: XML escaping could be added
> as new error handling option, e.g. "xml-escape" instead of
> "replace".
> Since XML escaping is general enough, I do think that adding
> such an option to all builtin codecs would be an acceptable
> and workable solution.

But that method has two problems: Handling "xml-escape" has to 
be implemented in every codec and it only solves one problem: 
escaping via numeric (decimal) XML character entities.

What if I want an output where "ß" is escaped as "ß"
and not "ß"?

And maybe I define my own entities, so that "あ"
will be written as "&hiraA;"?

Another use case is, when such a string is written to the terminal
(encoded with sys.getdefaultencoding()):
I want to hightlight the character entities, so I have to
put ANSI escape sequences around the escaped character.

Implementing all of this in all the codecs would be lot of work
and it is definitely nothing that should be part of the codecs
because it is too application specific.

> [...]


Bye,
   Walter Dörwald

-- 
Walter Dörwald · LivingLogic AG · Bayreuth, Germany · www.livinglogic.de