[I18n-sig] Proposal: Extended error handling forunicode.encode

"Walter Dörwald" walter@livinglogic.de
Wed, 03 Jan 2001 20:18:58 +0100


On 22.12.00 at 19:15 M.-A. Lemburg wrote:

> "Walter Dörwald" wrote:
> > 
> > On 21.12.00 at 18:30 M.-A. Lemburg wrote:
> > > [about state in encoders and error handlers]
> > But I don't see how this internal encoder state should influence
> > what the error handler does. There are two layers involved: The
> > character encoding layer and the "unencodable character escape
> > mechanism". Both layers are completely independent, even in your
> > "Unicode compression" example, where the "unencodable character
> > escape mechanism" is XML character entities.
> 
> This is true for your XML entity escape example, but error
> resolving in general will likely need to know about the
> current state of the encoder, e.g. to be able to write data
> corresponding page in the Unicode compression example or to
> force a switch of the current page to a different one.

How does this "Unicode compression example" look like?

> I know that error handling could be more generic, but passing
> a callable object instead of the error parameter is not an
> option since the internal APIs all use a const char parameter
> for error.

Changing this should can be done in one or two hours for someone 
who knows the Python internals. (Unfortunately I don't, I first
looked at unicodeobject.[hc] several days ago!)

> Besides, I consider such an approach a hack and not
> a solution.
> 
> Instead of trying to tweak the implementation into providing
> some kind of new error scheme, let's focus on finding a generic
> framework which could provide a solution for the general case
> while not breaking the existing applications.

Are the existing codecs (JapaneseCodecs etc.) to be considered part
of the existing applications?

The problem might be how to handle callbacks to C functions and
callback to Python functions in a consistent way. I.e. is it
extern DL_IMPORT(PyObject*) PyUnicode_Encode(
     const Py_UNICODE *s,        /* Unicode char buffer */
     int size,                   /* number of Py_UNICODE chars to encode */
     const char *encoding,       /* encoding */
     PyUnicodeObject *errorHandler(PyUnicodeObject *string, int position) /* error handling via C function */
     );
or
extern DL_IMPORT(PyObject*) PyUnicode_Encode(
     const Py_UNICODE *s,        /* Unicode char buffer */
     int size,                   /* number of Py_UNICODE chars to encode */
     const char *encoding,       /* encoding */
     PyObject *errorHandler /* error handling via Python function */
     );

> > > Writing your own function helpers which then apply all the necessary
> > > magic is simple and doesn't warrant changing APIs in the core.
> > 
> > It is not as simple as the error handler, but I could live with that.
> > 
> > The big problem is that it effectively kill the speed of your
> > application. Every XML application written in Python, no matter
> > what is does internally, will in the end have to produce an output
> > bytestring. Normally the output encoding should be one that produces
> > no unencodable characters, but you have to be prepared to handle
> > them. With the error handler the complete encoding will be done
> > in C code (with very infrequent calls to the error handler), so
> > this scheme gives the best speed possible.
> 
> It would give even better performance if the codec would provide
> this hook in some way at C level.

extern DL_IMPORT(PyObject*) PyUnicode_Encode(
     const Py_UNICODE *s,        /* Unicode char buffer */
     int size,                   /* number of Py_UNICODE chars to encode */
     const char *encoding,       /* encoding */
     PyUnicodeObject *errorHandler(PyUnicodeObject *string, int position) /* error handling via C function */
     );
would, but thats not the point. When you use an encoding, where more
than 20% of the characters have to be escaped (as XML entities or whatever)
you're using the wrong encoding.

> Note that almost all codecs
> have their own error handlers written in C already.
>
> > > Since the error handling is extensible by adding new options
> > > such as 'callback',
> > 
> > I would prefer a more object oriented way of extending the error
> > handling.
> 
> Sure, but we have to assure backward compatibility as well.
>  
> > > the existing codecs could be extended to
> > > provide this functionality as well. We'd only need a way to
> > > pass the callback to the codecs in some way, e.g. by using
> > > a keyword argument on the constructor or by subclassing it
> > > and providing a new method for the error handling in question.
> > 
> > There is no need for a string argument 'callback' and
> > an additional callback function/method that is passed to the
> > encoder. When the error argument is a string, the old mechanism
> > can be used, when it is a callable object the new will be used.
> 
> This is bad style and also gives problems in the core 
> implementation (have a look at unicodeobject.c).

I did, what is the problem with changing "const char *error" to
"PyObject *error"?


Bye,
   Walter Dörwald

-- 
Walter Dörwald · LivingLogic AG · Bayreuth, Germany · www.livinglogic.de