[Patches] [ python-Patches-432401 ] unicode encoding error callbacks

Tue, 12 Jun 2001 12:18:57 -0700

Patches item #432401, was updated on 2001-06-12 06:43
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=432401&group_id=5470

Category: core (C code)
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Walter Dörwald (doerwalter)
Assigned to: M.-A. Lemburg (lemburg)
Summary: unicode encoding error callbacks

Initial Comment:
This patch adds unicode error handling callbacks to the
encode functionality. With this patch it's possible to
not only pass 'strict', 'ignore' or 'replace' as the
errors argument to encode, but also a callable
function, that will be called with the encoding name,
the original unicode object and the position of the
unencodable character. The callback must return a
replacement unicode object that will be encoded instead
of the original character.

For example replacing unencodable characters with XML
character references can be done in the following way.

u"aäoöuüß".encode(
   "ascii",
   lambda enc, uni, pos: u"&#x%x;" % ord(uni[pos])
)

----------------------------------------------------------------------

>Comment By: Walter Dörwald (doerwalter)
Date: 2001-06-12 12:18

Message:
Logged In: YES 
user_id=89016

One additional note: It is vital that errors is an
assignable attribute of the StreamWriter. 

Consider the XML example: For writing an XML DOM tree one
StreamWriter object is used. When a text node is written,
the error handling has to be set to
codecs.xmlreplace_encode_errors, but inside a comment or
processing instruction replacing unencodable characters with
charrefs is not possible, so here codecs.raise_encode_errors
should be used (or better a custom error handler that raises
an error that says "sorry, you can't have unencodable
characters inside a comment")

BTW, should we continue the discussion in the i18n SIG
mailing list? An email program is much more comfortable than
a HTML textarea! ;)

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2001-06-12 11:59

Message:
Logged In: YES 
user_id=89016

How the callbacks work:

A PyObject * named errors is passed in. This may by NULL,
Py_None, 'strict', u'strict', 'ignore', u'ignore',
'replace', u'replace' or a callable object.
PyCodec_EncodeHandlerForObject maps all of these objects to
one of the three builtin error callbacks
PyCodec_RaiseEncodeErrors (raises an exception),
PyCodec_IgnoreEncodeErrors (returns an empty replacement
string, in effect ignoring the error),
PyCodec_ReplaceEncodeErrors (returns U+FFFD, the Unicode
replacement character to signify to the encoder that it
should choose a suitable replacement character) or directly
returns errors if it is a callable object. When an
unencodable character is encounterd the error handling
callback will be called with the encoding name, the original
unicode object and the error position and must return a
unicode object that will be encoded instead of the offending
character (or the callback may of course raise an
exception). U+FFFD characters in the replacement string will 
be replaced with a character that the encoder chooses ('?'
in all cases).

The implementation of the loop through the string is done in
the following way. A stack with two strings is kept and the
loop always encodes a character from the string at the
stacktop. If an error is encountered and the stack has only
one entry (during encoding of the original string) the
callback is called and the unicode object returned is pushed
on the stack, so the encoding continues with the replacement
string. If the stack has two entries when an error is
encountered, the replacement string itself has an
unencodable character and a normal exception raised. When
the encoder has reached the end of it's current string there
are two possibilities: when the stack contains two entries,
this was the replacement string, so the replacement string
will be poppep from the stack and encoding continues with
the next character from the original string. If the stack
had only one entry, encoding is finished.

(I hope that's enough explanation of the API and implementation)

I have renamed the static ...121 function to all lowercase
names.

BTW, I guess PyUnicode_EncodeUnicodeEscape could be
reimplemented as PyUnicode_EncodeASCII with a \uxxxx
replacement callback.

PyCodec_RaiseEncodeErrors, PyCodec_IgnoreEncodeErrors,
PyCodec_ReplaceEncodeErrors are globally visible because
they have to be available in _codecsmodule.c to wrap them as
Python function objects, but they can't be implemented in
_codecsmodule, because they need to be available to the
encoders in unicodeobject.c (through
PyCodec_EncodeHandlerForObject), but importing the codecs
module might result in an endless recursion, because
importing a module requires unpickling of the bytecode,
which might require decoding utf8, which ... (but this will
only happen, if we implement the same mechanism for the
decoding API)

I have not touched PyUnicode_TranslateCharmap yet, 
should this function also support error callbacks? Why would
one want the insert None into the mapping to call the callback?

A remaining problem is how to implement decoding error
callbacks. In Python 2.1 encoding and decoding errors are
handled in the same way with a string value. But with
callbacks it doesn't make sense to use the same callback for
encoding and decoding (like codecs.StreamReaderWriter and
codecs.StreamRecoder do). Decoding callbacks have a
different API. Which arguments should be passed to the
decoding callback, and what is the decoding callback
supposed to do?

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-06-12 11:00

Message:
Logged In: YES 
user_id=38388

About the Py_UNICODE*data, int size APIs:
Ok, point taken.

In general, I think we ought to keep the callback feature as
open as possible, so passing in pointers and sizes would not
be very useful.

BTW, could you summarize how the callback works in a few
lines ?

About _Encode121: I'd name this _EncodeUCS1 since that's
what it is ;-)

About the new functions: I was referring to the new static
functions which you gave PyUnicode_... names. If these are
not supposed to turn into non-static functions, I'd rather
have them use lower case names (since that's how the Python
internals work too -- most of the times).

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2001-06-12 09:56

Message:
Logged In: YES 
user_id=89016

> One thing which I don't like about your API change is that
> you removed the Py_UNICODE*data, int size style arguments
> --
> this makes it impossible to use the new APIs on non-Python
> data or data which is not available as Unicode object.

Another problem is, that the callback requires a Python
object, so in the PyObject *version, the refcount is
incref'd and the object is passed to the callback. The
Py_UNICODE*/int version would have to create a new Unicode
object from the data.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2001-06-12 09:32

Message:
Logged In: YES 
user_id=89016

> * please don't place more than one C statement on one line
> like in:
> """
> +               unicode = unicode2; unicodepos =
> unicode2pos;
> +               unicode2 = NULL; unicode2pos = 0;
> """

OK, done!

> * Comments should start with a capital letter and be
> prepended
> to the section they apply to

Fixed!

> * There should be spaces between arguments in compares
> (a == b) not (a==b)

Fixed!

> * Where does the name "...Encode121" originate ?

encode one-to-one, it implements both ASCII and latin-1
encoding.

> * module internal APIs should use lower case names (you
> converted some of these to  PyUnicode_...() -- this is
> normally reserved for APIs which are either marked as
> potential candidates for the public API or are very
> prominent in the code)

Which ones? I introduced a new function for every old one,
that had a "const char *errors" argument, and a few new ones
in codecs.h, of those PyCodec_EncodeHandlerForObject is
vital, because it is used to map for old string arguments to
the new function objects. PyCodec_RaiseEncodeErrors can be
used in the encoder implementation to raise an encode error,
but it could be made static in unicodeobject.h so only those
encoders implemented there have access to it.

> One thing which I don't like about your API change is that
> you removed the Py_UNICODE*data, int size style arguments > --
> this makes it impossible to use the new APIs on non-Python
> data or data which is not available as Unicode object.

I look through the code and found no situation where the
Py_UNICODE*/int version is really used and having two
(PyObject *)s (the original and the replacement string),
instead of UNICODE*/int and PyObject * made the
implementation a little easier, but I can fix that.

> Please separate the errors.c patch from this patch -- it
> seems totally unrelated to Unicode.

PyCodec_RaiseEncodeErrors uses this the have a \Uxxxx with
four hex digits. I removed it.

I'll upload a revised patch as soon as it's done.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-06-12 07:29

Message:
Logged In: YES 
user_id=38388

Thanks for the patch -- it looks very impressive !.

I'll give it a try later this week. 

Some first cosmetic tidbits:
* please don't place more than one C statement on one line
like in:
"""
+               unicode = unicode2; unicodepos =
unicode2pos;
+               unicode2 = NULL; unicode2pos = 0;
"""

* Comments should start with a capital letter and be
prepended
to the section they apply to

* There should be spaces between arguments in compares
(a == b) not (a==b)

* Where does the name "...Encode121" originate ?

* module internal APIs should use lower case names (you
converted some of these to  PyUnicode_...() -- this is
normally reserved for APIs which are either marked as
potential candidates for the public API or are very
prominent in the code)

One thing which I don't like about your API change is that
you removed the Py_UNICODE*data, int size style arguments --
this makes it impossible to use the new APIs on non-Python
data or data which is not available as Unicode object.

Please separate the errors.c patch from this patch -- it
seems totally unrelated to Unicode.

Thanks.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=432401&group_id=5470