[Patches] [ python-Patches-432401 ] unicode encoding error callbacks

noreply@sourceforge.net noreply@sourceforge.net
Wed, 06 Mar 2002 17:29:47 -0800


Patches item #432401, was opened at 2001-06-12 15:43
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=432401&group_id=5470

Category: Core (C code)
Group: None
Status: Open
Resolution: Postponed
Priority: 5
Submitted By: Walter Dörwald (doerwalter)
Assigned to: M.-A. Lemburg (lemburg)
Summary: unicode encoding error callbacks

Initial Comment:
This patch adds unicode error handling callbacks to the
encode functionality. With this patch it's possible to
not only pass 'strict', 'ignore' or 'replace' as the
errors argument to encode, but also a callable
function, that will be called with the encoding name,
the original unicode object and the position of the
unencodable character. The callback must return a
replacement unicode object that will be encoded instead
of the original character.

For example replacing unencodable characters with XML
character references can be done in the following way.

u"aäoöuüß".encode(
   "ascii",
   lambda enc, uni, pos: u"&#x%x;" % ord(uni[pos])
)




----------------------------------------------------------------------

>Comment By: Walter Dörwald (doerwalter)
Date: 2002-03-07 02:29

Message:
Logged In: YES 
user_id=89016

I started from scratch, and the current state is this:

Encoding mostly works (except that I haven't changed 
TranslateCharmap and EncodeDecimal yet) and most of the 
decoding stuff works (DecodeASCII and DecodeCharmap are 
still unchanged) and the decoding callback helper isn't 
optimized for the "builtin" names yet (i.e. it still calls 
the handler).

For encoding the callback helper knows how to 
handle "strict", "replace", "ignore" 
and "xmlcharrefreplace" itself and won't call the callback. 
This should make the encoder fast enough. As callback name 
string comparison results are cached it might even be 
faster than the original.

The patch so far didn't require any changes to 
unicodeobject.h, stringobject.h or stringobject.c


----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-03-05 17:49

Message:
Logged In: YES 
user_id=38388

Walter, are you making any progress on the new scheme
we discussed on the mailing list (adding an error handler
registry much like the codec registry itself instead of trying 
to redo the complete codec API) ?

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-09-20 12:38

Message:
Logged In: YES 
user_id=38388

I am postponing this patch until the PEP process has started. This feature won't make it into Python 2.2. 

Walter, you may want to reference this patch in the PEP.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-08-16 12:53

Message:
Logged In: YES 
user_id=38388

I think we ought to summarize these changes in a PEP to get some more feedback and testing from others as 
well.

I'll look into this after I'm back from vacation on the 10.09.

Given the release schedule I am not sure whether this feature will make it into 2.2. The size of the patch is huge 
and probably needs a lot of testing first.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2001-07-27 05:55

Message:
Logged In: YES 
user_id=89016

Changing the decoding API is done now. There 
are new functions
codec.register_unicodedecodeerrorhandler and
codec.lookup_unicodedecodeerrorhandler. 
Only the standard handlers for 'strict', 
'ignore' and 'replace' are preregistered.

There may be many reasons for decoding errors 
in the byte string, so I added an additional
argument to the decoding API: reason, which 
gives the reason for the failure, e.g.:

>>> "\U1111111".decode("unicode_escape")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: encoding 'unicodeescape' can't decode byte 
0x31 in position 8: truncated \UXXXXXXXX escape
>>> "\U11111111".decode("unicode_escape")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: encoding 'unicodeescape' can't decode byte 
0x31 in position 9: illegal Unicode character

For symmetry I added this to the encoding API too:
>>> u"\xff".encode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: encoding 'ascii' can't decode byte 0xff in 
position 0: ordinal not in range(128)

The parameters passed to the callbacks now are:
encoding, unicode, position, reason, state.

The encoding and decoding API for strings has been 
adapted too, so now the new API should be usable 
everywhere:

>>> unicode("a\xffb\xffc", "ascii", 
...    lambda enc, uni, pos, rea, sta: (u"<?>", pos+1))
u'a<?>b<?>c'
>>> "a\xffb\xffc".decode("ascii",
...    lambda enc, uni, pos, rea, sta: (u"<?>", 
pos+1))            
u'a<?>b<?>c'

I had a problem with the decoding API: all the 
functions in _codecsmodule.c used the t# format 
specifier. I changed that to O! with 
&PyString_Type, because otherwise we would have 
the problem that the decoding API would must pass
buffer object around instead of strings, and 
the callback would have to call str() on the 
buffer anyway to access a specific character, so 
this wouldn't be any faster than calling str() 
on the buffer before decoding. It seems that 
buffers  aren't used anyway. 

I changed all the old function to call the new 
ones so bugfixes don't have to be done in two 
places. There are two exceptions: I didn't 
change PyString_AsEncodedString and 
PyString_AsDecodedString because they are 
documented as deprecated anyway (although they 
are called in a few spots) This means that I 
duplicated part of their functionality in 
PyString_AsEncodedObjectEx and 
PyString_AsDecodedObjectEx.

There are still a few spots that call the old API:
E.g. PyString_Format still calls PyUnicode_Decode 
(but with strict decoding) because it passes the 
rest of the format string to PyUnicode_Format 
when it encounters a Unicode object.

Should we switch to the new API everywhere even 
if strict encoding/decoding is used?

The size of this patch begins to scare me. I 
guess we need an extensive test script for all the 
new features and documentation. I hope you have time 
to do that, as I'll be busy with other projects in
the next weeks. (BTW, I have't touched 
PyUnicode_TranslateCharmap yet.)


----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2001-07-23 19:03

Message:
Logged In: YES 
user_id=89016

New version of the patch with the error handling callback 
registry. 

> > OK, done, now there's a
> > PyCodec_EscapeReplaceUnicodeEncodeErrors/
> > codecs.escapereplace_unicodeencode_errors
> > that uses \u (or \U if x>0xffff (with a wide build
> > of Python)).
> 
> Great!

Now PyCodec_EscapeReplaceUnicodeEncodeErrors uses \x
in addition to \u and \U where appropriate.

> > [...] 
> > But for special one-shot error handlers, it might still 
be
> > useful to pass the error handler directly, so maybe we
> > should leave error as PyObject *, but implement the
> > registry anyway?
> 
> Good idea !
> 
> One minor nit: codecs.registerError() should be named
> codecs.register_errorhandler() to be more inline with
> the Python coding style guide.

OK, but these function are specific to unicode encoding,
so now the functions are called:
   codecs.register_unicodeencodeerrorhandler
   codecs.lookup_unicodeencodeerrorhandler

Now all callbacks (including the new 
ones: "xmlcharrefreplace" 
and "escapereplace") are registered in the 
codecs.c/_PyCodecRegistry_Init so using them is really 
simple: u"gürk".encode("ascii", "xmlcharrefreplace")


----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-07-13 13:26

Message:
Logged In: YES 
user_id=38388

> > >    > > BTW, I guess PyUnicode_EncodeUnicodeEscape
> > >    > > could be reimplemented as PyUnicode_EncodeASCII
> > >    > > with \uxxxx replacement callback.
> > >    >
> > >    > Hmm, wouldn't that result in a slowdown ? If so,
> > >    > I'd rather leave the special encoder in place,
> > >    > since it is being used a lot in Python and
> > >    > probably some applications too.
> > >
> > >    It would be a slowdown. But callbacks open many
> > >    possiblities.
> >
> > True, but in this case I believe that we should stick with
> > the native implementation for "unicode-escape". Having
> > a standard callback error handler which does the \uXXXX
> > replacement would be nice to have though, since this would
> > also be usable with lots of other codecs (e.g. all the
> > code page ones).
> 
> OK, done, now there's a
> PyCodec_EscapeReplaceUnicodeEncodeErrors/
> codecs.escapereplace_unicodeencode_errors
> that uses \u (or \U if x>0xffff (with a wide build
> of Python)).

Great !
 
> > [...]
> > >    Should the old TranslateCharmap map to the new
> > >    TranslateCharmapEx and inherit the
> > >    "multicharacter replacement" feature,
> > >    or should I leave it as it is?
> >
> > If possible, please also add the multichar replacement
> > to the old API. I think it is very useful and since the
> > old APIs work on raw buffers it would be a benefit to have
> > the functionality in the old implementation too.
> 
> OK! I will try to find the time to implement that in the
> next days.

Good.
 
> > [Decoding error callbacks]
> >
> > About the return value:
> >
> > I'd suggest to always use the same tuple interface, e.g.
> >
> >     callback(encoding, input_data, input_position,
> state) ->
> >         (output_to_be_appended, new_input_position)
> >
> > (I think it's better to use absolute values for the
> > position rather than offsets.)
> >
> > Perhaps the encoding callbacks should use the same
> > interface... what do you think ?
> 
> This would make the callback feature hypergeneric and a
> little slower, because tuples have to be created, but it
> (almost) unifies the encoding and decoding API. ("almost"
> because, for the encoder output_to_be_appended will be
> reencoded, for the decoder it will simply be appended.),
> so I'm for it.

That's the point. 

Note that I don't think the tuple creation
will hurt much (see the make_tuple() API in codecs.c)
since small tuples are cached by Python internally.
 
> I implemented this and changed the encoders to only
> lookup the error handler on the first error. The UCS1
> encoder now no longer uses the two-item stack strategy.
> (This strategy only makes sense for those encoder where
> the encoding itself is much more complicated than the
> looping/callback etc.) So now memory overflow tests are
> only done, when an unencodable error occurs, so now the
> UCS1 encoder should be as fast as it was without
> error callbacks.
> 
> Do we want to enforce new_input_position>input_position,
> or should jumping back be allowed?

No; moving backwards should be allowed (this may be useful
in order to resynchronize with the input data).
 
> Here's is the current todo list:
> 1. implement a new TranslateCharmap and fix the old.
> 2. New encoding API for string objects too.
> 3. Decoding
> 4. Documentation
> 5. Test cases
> 
> I'm thinking about a different strategy for implementing
> callbacks
> (see http://mail.python.org/pipermail/i18n-sig/2001-
> July/001262.html)
> 
> We coould have a error handler registry, which maps names
> to error handlers, then it would be possible to keep the
> errors argument as "const char *" instead of "PyObject *".
> Currently PyCodec_UnicodeEncodeHandlerForObject is a
> backwards compatibility hack that will never go away,
> because
> it's always more convenient to type
>    u"...".encode("...", "strict")
> instead of
>    import codecs
>    u"...".encode("...", codecs.raise_encode_errors)
> 
> But with an error handler registry this function would
> become the official lookup method for error handlers.
> (PyCodec_LookupUnicodeEncodeErrorHandler?)
> Python code would look like this:
> ---
> def xmlreplace(encoding, unicode, pos, state):
>    return (u"&#%d;" % ord(uni[pos]), pos+1)
> 
> import codec
> 
> codec.registerError("xmlreplace",xmlreplace)
> ---
> and then the following call can be made:
>         u"äöü".encode("ascii", "xmlreplace")
> As soon as the first error is encountered, the encoder uses
> its builtin error handling method if it recognizes the name
> ("strict", "replace" or "ignore") or looks up the error
> handling function in the registry if it doesn't. In this way
> the speed for the backwards compatible features is the same
> as before and "const char *error" can be kept as the
> parameter to all encoding functions. For speed common error
> handling names could even be implemented in the encoder
> itself.
> 
> But for special one-shot error handlers, it might still be
> useful to pass the error handler directly, so maybe we
> should leave error as PyObject *, but implement the
> registry anyway?

Good idea !

One minor nit: codecs.registerError() should be named
codecs.register_errorhandler() to be more inline with
the Python coding style guide.


----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2001-07-12 13:03

Message:
Logged In: YES 
user_id=89016

> >    [...]
> >    so I guess we could change the replace handler
> >    to always return u'?'. This would make the
> >    implementation a little bit simpler, but the 
> >    explanation of the callback feature *a lot* 
> >    simpler. 
> 
> Go for it.

OK, done!

> [...]
> >    > Could you add these docs to the Misc/unicode.txt
> >    > file ? I will eventually take that file and turn 
> >    > it into a PEP which will then serve as general 
> >    > documentation for these things.
> > 
> >    I could, but first we should work out how the 
> >    decoding callback API will work.
> 
> Ok. BTW, Barry Warsaw already did the work of converting
> the unicode.txt to PEP 100, so the docs should eventually 
> go there.

OK. I guess it would be best to do this when everything 
is finished.

> >    > > BTW, I guess PyUnicode_EncodeUnicodeEscape
> >    > > could be reimplemented as PyUnicode_EncodeASCII 
> >    > > with \uxxxx replacement callback.
> >    >
> >    > Hmm, wouldn't that result in a slowdown ? If so,
> >    > I'd rather leave the special encoder in place, 
> >    > since it is being used a lot in Python and 
> >    > probably some applications too.
> > 
> >    It would be a slowdown. But callbacks open many 
> >    possiblities.
> 
> True, but in this case I believe that we should stick with
> the native implementation for "unicode-escape". Having
> a standard callback error handler which does the \uXXXX
> replacement would be nice to have though, since this would
> also be usable with lots of other codecs (e.g. all the
> code page ones).

OK, done, now there's a 
PyCodec_EscapeReplaceUnicodeEncodeErrors/
codecs.escapereplace_unicodeencode_errors
that uses \u (or \U if x>0xffff (with a wide build
of Python)).

> >    For example:
> > 
> >       Why can't I print u"gürk"?
> > 
> >    is probably one of the most frequently asked
> >    questions in comp.lang.python. For printing 
> >    Unicode stuff, print could be extended the use an 
> >    error handling callback for Unicode strings (or 
> >    objects where __str__ or tp_str returns a Unicode 
> >    object) instead of using str() which always 
> >    returns an 8bit string and uses strict encoding. 
> >    There might even be a
> >    sys.setprintencodehandler()/sys.getprintencodehandler
()
> 
> There already is a print callback in Python (forgot the
> name of the hook though), so this should be possible by 
> providing the encoding logic in the hook.

True: sys.displayhook

> [...]
> >    Should the old TranslateCharmap map to the new 
> >    TranslateCharmapEx and inherit the 
> >    "multicharacter replacement" feature,
> >    or should I leave it as it is?
> 
> If possible, please also add the multichar replacement
> to the old API. I think it is very useful and since the
> old APIs work on raw buffers it would be a benefit to have
> the functionality in the old implementation too.

OK! I will try to find the time to implement that in the 
next days.

> [Decoding error callbacks]
>
> About the return value:
> 
> I'd suggest to always use the same tuple interface, e.g.
> 
>     callback(encoding, input_data, input_position, 
state) -> 
>         (output_to_be_appended, new_input_position)
> 
> (I think it's better to use absolute values for the 
> position rather than offsets.)
> 
> Perhaps the encoding callbacks should use the same 
> interface... what do you think ?

This would make the callback feature hypergeneric and a
little slower, because tuples have to be created, but it
(almost) unifies the encoding and decoding API. ("almost" 
because, for the encoder output_to_be_appended will be 
reencoded, for the decoder it will simply be appended.), 
so I'm for it.

I implemented this and changed the encoders to only 
lookup the error handler on the first error. The UCS1 
encoder now no longer uses the two-item stack strategy. 
(This strategy only makes sense for those encoder where 
the encoding itself is much more complicated than the 
looping/callback etc.) So now memory overflow tests are 
only done, when an unencodable error occurs, so now the 
UCS1 encoder should be as fast as it was without 
error callbacks.

Do we want to enforce new_input_position>input_position,
or should jumping back be allowed?

> >    > > One additional note: It is vital that errors
> >    > > is an assignable attribute of the StreamWriter.
> >    >
> >    > It is already !
> > 
> >    I know, but IMHO it should be documented that an
> >    assignable errors attribute must be supported 
> >    as part of the official codec API.
> > 
> >    Misc/unicode.txt is not clear on that:
> >    """
> >    It is not required by the Unicode implementation
> >    to use these base classes, only the interfaces must 
> >    match; this allows writing Codecs as extension types.
> >    """
> 
> Good point. I'll add that to the PEP 100.

OK.

Here's is the current todo list:
1. implement a new TranslateCharmap and fix the old.
2. New encoding API for string objects too.
3. Decoding
4. Documentation
5. Test cases

I'm thinking about a different strategy for implementing 
callbacks
(see http://mail.python.org/pipermail/i18n-sig/2001-
July/001262.html)

We coould have a error handler registry, which maps names 
to error handlers, then it would be possible to keep the 
errors argument as "const char *" instead of "PyObject *". 
Currently PyCodec_UnicodeEncodeHandlerForObject is a 
backwards compatibility hack that will never go away, 
because 
it's always more convenient to type
   u"...".encode("...", "strict")
instead of
   import codecs
   u"...".encode("...", codecs.raise_encode_errors)

But with an error handler registry this function would 
become the official lookup method for error handlers. 
(PyCodec_LookupUnicodeEncodeErrorHandler?)
Python code would look like this:
---
def xmlreplace(encoding, unicode, pos, state):
   return (u"&#%d;" % ord(uni[pos]), pos+1)

import codec

codec.registerError("xmlreplace",xmlreplace)
---
and then the following call can be made:
	u"äöü".encode("ascii", "xmlreplace")
As soon as the first error is encountered, the encoder uses
its builtin error handling method if it recognizes the name 
("strict", "replace" or "ignore") or looks up the error 
handling function in the registry if it doesn't. In this way
the speed for the backwards compatible features is the same 
as before and "const char *error" can be kept as the 
parameter to all encoding functions. For speed common error 
handling names could even be implemented in the encoder 
itself.

But for special one-shot error handlers, it might still be 
useful to pass the error handler directly, so maybe we 
should leave error as PyObject *, but implement the 
registry anyway?


----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-07-10 14:29

Message:
Logged In: YES 
user_id=38388

Ok, here we go...

>    > > raise an exception). U+FFFD characters in the 
>    replacement
>    > > string will be replaced with a character that the 
>    encoder
>    > > chooses ('?' in all cases).
>    >
>    > Nice.
> 
>    But the special casing of U+FFFD makes the interface 
>    somewhat
>    less clean than it could be. It was only done to be 100%
>    backwards compatible. With the original "replace"
>    error
>    handling the codec chose the replacement character. But as
>    far as I can tell none of the codecs uses anything other
>    than '?', 

True.

>    so I guess we could change the replace handler
>    to always return u'?'. This would make the implementation a
>    little bit simpler, but the explanation of the callback
>    feature *a lot* simpler. 

Go for it.

>    And if you still want to handle
>    an unencodable U+FFFD, you can write a special callback for
>    that, e.g.
> 
>    def FFFDreplace(enc, uni, pos):
>    if uni[pos] == "\ufffd":
>    return u"?"
>    else:
>    raise UnicodeError(...)
>
>    > ...docs...
>    >
>    > Could you add these docs to the Misc/unicode.txt file ? I
>    > will eventually take that file and turn it into a PEP 
>    which
>    > will then serve as general documentation for these things.
> 
>    I could, but first we should work out how the decoding
>    callback API will work.

Ok. BTW, Barry Warsaw already did the work of converting the
unicode.txt to PEP 100, so the docs should eventually go there.
 
>    > > BTW, I guess PyUnicode_EncodeUnicodeEscape could be
>    > > reimplemented as PyUnicode_EncodeASCII with a \uxxxx
>    > > replacement callback.
>    >
>    > Hmm, wouldn't that result in a slowdown ? If so, I'd 
>    rather
>    > leave the special encoder in place, since it is being 
>    used a
>    > lot in Python and probably some applications too.
> 
>    It would be a slowdown. But callbacks open many 
>    possiblities.

True, but in this case I believe that we should stick with
the native implementation for "unicode-escape". Having
a standard callback error handler which does the \uXXXX
replacement would be nice to have though, since this would
also be usable with lots of other codecs (e.g. all the code page
ones).
 
>    For example:
> 
>       Why can't I print u"gürk"?
> 
>    is probably one of the most frequently asked questions in
>    comp.lang.python. For printing Unicode stuff, print could be
>    extended the use an error handling callback for Unicode 
>    strings (or objects where __str__ or tp_str returns a 
>    Unicode object) instead of using str() which always returns 
>    an 8bit string and uses strict encoding. There might even 
>    be a
>    sys.setprintencodehandler()/sys.getprintencodehandler()

There already is a print callback in Python (forgot the name of the
hook though), so this should be possible by providing the
encoding logic in the hook.
 
>    > > I have not touched PyUnicode_TranslateCharmap yet,
>    > > should this function also support error callbacks? Why
>    > > would one want the insert None into the mapping to
>    call
>    > > the callback?
>    >
>    > 1. Yes.
>    > 2. The user may want to e.g. restrict usage of certain
>    > character ranges. In this case the codec would be used to
>    > verify the input and an exception would indeed be useful
>    > (e.g. say you want to restrict input to Hangul + ASCII).
> 
>    OK, do we want TranslateCharmap to work exactly like 
>    encoding,
>    i.e. in case of an error should the returned replacement
>    string again be mapped through the translation mapping or
>    should it be copied to the output directly? The former would
>    be more in line with encoding, but IMHO the latter would
>    be much more useful.

It's better to take the second approach (copy the callback
output directly to the output string) to avoid endless
recursion and other pitfalls.

I suppose this will also simplify the implementation somewhat.
 
>    BTW, when I implement it I can implement patch #403100
>    ("Multicharacter replacements in 
>    PyUnicode_TranslateCharmap")
>    along the way.

I've seen it; will comment on it later.
 
>    Should the old TranslateCharmap map to the new 
>    TranslateCharmapEx
>    and inherit the "multicharacter replacement" feature,
>    or
>    should I leave it as it is?

If possible, please also add the multichar replacement
to the old API. I think it is very useful and since the
old APIs work on raw buffers it would be a benefit to have
the functionality in the old implementation too.
 
[Decoding error callbacks]

>    > > A remaining problem is how to implement decoding error
>    > > callbacks. In Python 2.1 encoding and decoding errors 
>    are
>    > > handled in the same way with a string value. But with
>    > > callbacks it doesn't make sense to use the same
>    callback
>    > > for encoding and decoding (like 
>    codecs.StreamReaderWriter
>    > > and codecs.StreamRecoder do). Decoding callbacks have
>    a
>    > > different API. Which arguments should be passed to the
>    > > decoding callback, and what is the decoding callback
>    > > supposed to do?
>    >
>    > I'd suggest adding another set of PyCodec_UnicodeDecode...
>    ()
>    > APIs for this. We'd then have to augment the base classes 
>    of
>    > the StreamCodecs to provide two attributes for .errors 
>    with
>    > a fallback solution for the string case (i.s. "strict"
>    can
>    > still be used for both directions).
> 
>    Sounds good. Now what is the decoding callback supposed to 
>    do?
>    I guess it will be called in the same way as the encoding
>    callback, i.e. with encoding name, original string and
>    position of the error. It might returns a Unicode string
>    (i.e. an object of the decoding target type), that will be
>    emitted from the codec instead of the one offending byte. Or
>    it might return a tuple with replacement Unicode object and
>    a resynchronisation offset, i.e. returning (u"?", 1)
>    means
>    emit a '?' and skip the offending character. But to make
>    the offset really useful the callback has to know something
>    about the encoding, perhaps the codec should be allowed to
>    pass an additional state object to the callback?
> 
>    Maybe the same should be added to the encoding callbacks to?
>    Maybe the encoding callback should be able to tell the
>    encoder if the replacement returned should be reencoded
>    (in which case it's a Unicode object), or directly emitted
>    (in which case it's an 8bit string)?

I like the idea of having an optional state object (basically
this should be a codec-defined arbitrary Python object)
which then allow the callback to apply additional tricks.
The object should be documented to be modifyable in place
(simplifies the interface).

About the return value:

I'd suggest to always use the same tuple interface, e.g.

    callback(encoding, input_data, input_position, state) -> 
        (output_to_be_appended, new_input_position)

(I think it's better to use absolute values for the position 
rather than offsets.)

Perhaps the encoding callbacks should use the same 
interface... what do you think ?

>    > > One additional note: It is vital that errors is an
>    > > assignable attribute of the StreamWriter.
>    >
>    > It is already !
> 
>    I know, but IMHO it should be documented that an assignable
>    errors attribute must be supported as part of the official
>    codec API.
> 
>    Misc/unicode.txt is not clear on that:
>    """
>    It is not required by the Unicode implementation to use 
>    these base classes, only the interfaces must match; this 
>    allows writing Codecs as extension types.
>    """

Good point. I'll add that to the PEP 100.


----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-06-22 22:51

Message:
Logged In: YES 
user_id=38388

Sorry to keep you waiting, Walter. I will look into this
again next week -- this week was way too busy...

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-06-13 19:00

Message:
Logged In: YES 
user_id=38388

On your comment about the non-Unicode codecs: let's keep
this separated from the current patch.

Don't have much time today. I'll comment on the other things
tomorrow.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2001-06-13 17:49

Message:
Logged In: YES 
user_id=89016

Guido van Rossum wrote in python-dev:

> True, the "codec" pattern can be used for other 
> encodings than Unicode.  But it seems to me that the
> entire codecs architecture is rather strongly geared
> towards en/decoding Unicode, and it's not clear
> how well other codecs fit in this pattern (e.g. I 
> noticed that all the non-Unicode codecs ignore the 
> error handling parameter or assert that
> it is set to 'strict').

I noticed that too. asserting that errors=='strict' would 
mean that the encoder is not able to deal in any other way 
with unencodable stuff than by raising an error. But that 
is not the problem here, because for zlib, base64, quopri, 
hex and uu encoding there can be no unencodable characters. 
The encoders can simply ignore the errors parameter. Should 
I remove the asserts from those codecs and change the 
docstrings accordingly, or will this be done separately?


----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2001-06-13 15:57

Message:
Logged In: YES 
user_id=89016

> > [...]
> > raise an exception). U+FFFD characters in the 
replacement
> > string will be replaced with a character that the 
encoder
> > chooses ('?' in all cases).
>
> Nice.

But the special casing of U+FFFD makes the interface 
somewhat
less clean than it could be. It was only done to be 100%
backwards compatible. With the original "replace" error
handling the codec chose the replacement character. But as
far as I can tell none of the codecs uses anything other
than '?', so I guess we could change the replace handler
to always return u'?'. This would make the implementation a
little bit simpler, but the explanation of the callback
feature *a lot* simpler. And if you still want to handle
an unencodable U+FFFD, you can write a special callback for
that, e.g.

def FFFDreplace(enc, uni, pos):
if uni[pos] == "\ufffd":
return u"?"
else:
raise UnicodeError(...)

> > The implementation of the loop through the string is 
done
> > in the following way. A stack with two strings is kept
> > and the loop always encodes a character from the string
> > at the stacktop. If an error is encountered and the 
stack
> > has only one entry (during encoding of the original 
string)
> > the callback is called and the unicode object returned 
is
> > pushed on the stack, so the encoding continues with the
> > replacement string. If the stack has two entries when an
> > error is encountered, the replacement string itself has
> > an unencodable character and a normal exception raised.
> > When the encoder has reached the end of it's current 
string
> > there are two possibilities: when the stack contains two
> > entries, this was the replacement string, so the 
replacement
> > string will be poppep from the stack and encoding 
continues
> > with the next character from the original string. If the
> > stack had only one entry, encoding is finished.
>
> Very elegant solution !

I'll put it as a comment in the source.

> > (I hope that's enough explanation of the API and
> implementation)
>
> Could you add these docs to the Misc/unicode.txt file ? I
> will eventually take that file and turn it into a PEP 
which
> will then serve as general documentation for these things.

I could, but first we should work out how the decoding
callback API will work.

> > I have renamed the static ...121 function to all 
lowercase
> > names.
>
> Ok.
>
> > BTW, I guess PyUnicode_EncodeUnicodeEscape could be
> > reimplemented as PyUnicode_EncodeASCII with a \uxxxx
> > replacement callback.
>
> Hmm, wouldn't that result in a slowdown ? If so, I'd 
rather
> leave the special encoder in place, since it is being 
used a
> lot in Python and probably some applications too.

It would be a slowdown. But callbacks open many 
possiblities.

For example:

   Why can't I print u"gürk"?

is probably one of the most frequently asked questions in
comp.lang.python. For printing Unicode stuff, print could be
extended the use an error handling callback for Unicode 
strings (or objects where __str__ or tp_str returns a 
Unicode object) instead of using str() which always returns 
an 8bit string and uses strict encoding. There might even 
be a
sys.setprintencodehandler()/sys.getprintencodehandler()

> [...]
> I think it would be worthwhile to rename the callbacks to
> include "Unicode" somewhere, e.g.
> PyCodec_UnicodeReplaceEncodeErrors(). It's a long name, 
but
> then it points out the application field of the callback
> rather well. Same for the callbacks exposed through the
> _codecsmodule.

OK, done (and PyCodec_XMLCharRefReplaceUnicodeEncodeErrors
really is a long name ;))

> > I have not touched PyUnicode_TranslateCharmap yet,
> > should this function also support error callbacks? Why
> > would one want the insert None into the mapping to call
> > the callback?
>
> 1. Yes.
> 2. The user may want to e.g. restrict usage of certain
> character ranges. In this case the codec would be used to
> verify the input and an exception would indeed be useful
> (e.g. say you want to restrict input to Hangul + ASCII).

OK, do we want TranslateCharmap to work exactly like 
encoding,
i.e. in case of an error should the returned replacement
string again be mapped through the translation mapping or
should it be copied to the output directly? The former would
be more in line with encoding, but IMHO the latter would
be much more useful.

BTW, when I implement it I can implement patch #403100
("Multicharacter replacements in 
PyUnicode_TranslateCharmap")
along the way.

Should the old TranslateCharmap map to the new 
TranslateCharmapEx
and inherit the "multicharacter replacement" feature, or
should I leave it as it is?

> > A remaining problem is how to implement decoding error
> > callbacks. In Python 2.1 encoding and decoding errors 
are
> > handled in the same way with a string value. But with
> > callbacks it doesn't make sense to use the same callback
> > for encoding and decoding (like 
codecs.StreamReaderWriter
> > and codecs.StreamRecoder do). Decoding callbacks have a
> > different API. Which arguments should be passed to the
> > decoding callback, and what is the decoding callback
> > supposed to do?
>
> I'd suggest adding another set of PyCodec_UnicodeDecode...
()
> APIs for this. We'd then have to augment the base classes 
of
> the StreamCodecs to provide two attributes for .errors 
with
> a fallback solution for the string case (i.s. "strict" can
> still be used for both directions).

Sounds good. Now what is the decoding callback supposed to 
do?
I guess it will be called in the same way as the encoding
callback, i.e. with encoding name, original string and
position of the error. It might returns a Unicode string
(i.e. an object of the decoding target type), that will be
emitted from the codec instead of the one offending byte. Or
it might return a tuple with replacement Unicode object and
a resynchronisation offset, i.e. returning (u"?", 1) means
emit a '?' and skip the offending character. But to make
the offset really useful the callback has to know something
about the encoding, perhaps the codec should be allowed to
pass an additional state object to the callback?

Maybe the same should be added to the encoding callbacks to?
Maybe the encoding callback should be able to tell the
encoder if the replacement returned should be reencoded
(in which case it's a Unicode object), or directly emitted
(in which case it's an 8bit string)?

> > One additional note: It is vital that errors is an
> > assignable attribute of the StreamWriter.
>
> It is already !

I know, but IMHO it should be documented that an assignable
errors attribute must be supported as part of the official
codec API.

Misc/unicode.txt is not clear on that:
"""
It is not required by the Unicode implementation to use 
these base classes, only the interfaces must match; this 
allows writing Codecs as extension types.
"""

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-06-13 10:05

Message:
Logged In: YES 
user_id=38388

> How the callbacks work:
> 
> A PyObject * named errors is passed in. This may by NULL,
> Py_None, 'strict', u'strict', 'ignore', u'ignore',
> 'replace', u'replace' or a callable object.
> PyCodec_EncodeHandlerForObject maps all of these objects
to
> one of the three builtin error callbacks
> PyCodec_RaiseEncodeErrors (raises an exception),
> PyCodec_IgnoreEncodeErrors (returns an empty replacement
> string, in effect ignoring the error),
> PyCodec_ReplaceEncodeErrors (returns U+FFFD, the Unicode
> replacement character to signify to the encoder that it
> should choose a suitable replacement character) or
directly
> returns errors if it is a callable object. When an
> unencodable character is encounterd the error handling
> callback will be called with the encoding name, the
original
> unicode object and the error position and must return a
> unicode object that will be encoded instead of the
offending
> character (or the callback may of course raise an
> exception). U+FFFD characters in the replacement string
will
> be replaced with a character that the encoder chooses ('?'
> in all cases).

Nice.
 
> The implementation of the loop through the string is done
in
> the following way. A stack with two strings is kept and
the
> loop always encodes a character from the string at the
> stacktop. If an error is encountered and the stack has
only
> one entry (during encoding of the original string) the
> callback is called and the unicode object returned is
pushed
> on the stack, so the encoding continues with the
replacement
> string. If the stack has two entries when an error is
> encountered, the replacement string itself has an
> unencodable character and a normal exception raised. When
> the encoder has reached the end of it's current string
there
> are two possibilities: when the stack contains two
entries,
> this was the replacement string, so the replacement string
> will be poppep from the stack and encoding continues with
> the next character from the original string. If the stack
> had only one entry, encoding is finished.

Very elegant solution !
 
> (I hope that's enough explanation of the API and
implementation)

Could you add these docs to the Misc/unicode.txt file ? I
will eventually take that file and turn it into a PEP which
will then serve as general documentation for these things.
 
> I have renamed the static ...121 function to all lowercase
> names.

Ok.
 
> BTW, I guess PyUnicode_EncodeUnicodeEscape could be
> reimplemented as PyUnicode_EncodeASCII with a \uxxxx
> replacement callback.

Hmm, wouldn't that result in a slowdown ? If so, I'd rather
leave the special encoder in place, since it is being used a
lot in Python and probably some applications too.
 
> PyCodec_RaiseEncodeErrors, PyCodec_IgnoreEncodeErrors,
> PyCodec_ReplaceEncodeErrors are globally visible because
> they have to be available in _codecsmodule.c to wrap them
as
> Python function objects, but they can't be implemented in
> _codecsmodule, because they need to be available to the
> encoders in unicodeobject.c (through
> PyCodec_EncodeHandlerForObject), but importing the codecs
> module might result in an endless recursion, because
> importing a module requires unpickling of the bytecode,
> which might require decoding utf8, which ... (but this
will
> only happen, if we implement the same mechanism for the
> decoding API)

I think that codecs.c is the right place for these APIs.
_codecsmodule.c is only meant as Python access wrapper for
the internal codecs and nothing more. 

One thing I noted about the callbacks: they assume that they
will always get Unicode objects as input. This is certainly
not true in the general case (it is for the codecs you touch
in the patch). 

I think it would be worthwhile to rename the callbacks to
include "Unicode" somewhere, e.g.
PyCodec_UnicodeReplaceEncodeErrors(). It's a long name, but
then it points out the application field of the callback
rather well. Same for the callbacks exposed through the
_codecsmodule.

> I have not touched PyUnicode_TranslateCharmap yet,
> should this function also support error callbacks? Why
would
> one want the insert None into the mapping to call the
callback?

1. Yes.
2. The user may want to e.g. restrict usage of certain
character ranges. In this case the codec would be used to
verify the input and an exception would indeed be useful
(e.g. say you want to restrict input to Hangul + ASCII).
 
> A remaining problem is how to implement decoding error
> callbacks. In Python 2.1 encoding and decoding errors are
> handled in the same way with a string value. But with
> callbacks it doesn't make sense to use the same callback
for
> encoding and decoding (like codecs.StreamReaderWriter and
> codecs.StreamRecoder do). Decoding callbacks have a
> different API. Which arguments should be passed to the
> decoding callback, and what is the decoding callback
> supposed to do?

I'd suggest adding another set of PyCodec_UnicodeDecode...()
APIs for this. We'd then have to augment the base classes of
the StreamCodecs to provide two attributes for .errors with
a fallback solution for the string case (i.s. "strict" can
still be used for both directions).

> One additional note: It is vital that errors is an
> assignable attribute of the StreamWriter.

It is already !
 
> Consider the XML example: For writing an XML DOM tree one
> StreamWriter object is used. When a text node is written,
> the error handling has to be set to
> codecs.xmlreplace_encode_errors, but inside a comment or
> processing instruction replacing unencodable characters
with
> charrefs is not possible, so here
codecs.raise_encode_errors
> should be used (or better a custom error handler that
raises
> an error that says "sorry, you can't have unencodable
> characters inside a comment")

Sure.
 
> BTW, should we continue the discussion in the i18n SIG
> mailing list? An email program is much more comfortable
than
> a HTML textarea! ;)

I'd rather keep the discussions on this patch here --
forking it off to the i18n sig will make it very hard to
follow up on it. (This HTML area is indeed damn small ;-)
 


----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2001-06-12 21:18

Message:
Logged In: YES 
user_id=89016

One additional note: It is vital that errors is an
assignable attribute of the StreamWriter. 

Consider the XML example: For writing an XML DOM tree one
StreamWriter object is used. When a text node is written,
the error handling has to be set to
codecs.xmlreplace_encode_errors, but inside a comment or
processing instruction replacing unencodable characters with
charrefs is not possible, so here codecs.raise_encode_errors
should be used (or better a custom error handler that raises
an error that says "sorry, you can't have unencodable
characters inside a comment")

BTW, should we continue the discussion in the i18n SIG
mailing list? An email program is much more comfortable than
a HTML textarea! ;)



----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2001-06-12 20:59

Message:
Logged In: YES 
user_id=89016

How the callbacks work:

A PyObject * named errors is passed in. This may by NULL,
Py_None, 'strict', u'strict', 'ignore', u'ignore',
'replace', u'replace' or a callable object.
PyCodec_EncodeHandlerForObject maps all of these objects to
one of the three builtin error callbacks
PyCodec_RaiseEncodeErrors (raises an exception),
PyCodec_IgnoreEncodeErrors (returns an empty replacement
string, in effect ignoring the error),
PyCodec_ReplaceEncodeErrors (returns U+FFFD, the Unicode
replacement character to signify to the encoder that it
should choose a suitable replacement character) or directly
returns errors if it is a callable object. When an
unencodable character is encounterd the error handling
callback will be called with the encoding name, the original
unicode object and the error position and must return a
unicode object that will be encoded instead of the offending
character (or the callback may of course raise an
exception). U+FFFD characters in the replacement string will 
be replaced with a character that the encoder chooses ('?'
in all cases).

The implementation of the loop through the string is done in
the following way. A stack with two strings is kept and the
loop always encodes a character from the string at the
stacktop. If an error is encountered and the stack has only
one entry (during encoding of the original string) the
callback is called and the unicode object returned is pushed
on the stack, so the encoding continues with the replacement
string. If the stack has two entries when an error is
encountered, the replacement string itself has an
unencodable character and a normal exception raised. When
the encoder has reached the end of it's current string there
are two possibilities: when the stack contains two entries,
this was the replacement string, so the replacement string
will be poppep from the stack and encoding continues with
the next character from the original string. If the stack
had only one entry, encoding is finished.

(I hope that's enough explanation of the API and implementation)

I have renamed the static ...121 function to all lowercase
names.

BTW, I guess PyUnicode_EncodeUnicodeEscape could be
reimplemented as PyUnicode_EncodeASCII with a \uxxxx
replacement callback.

PyCodec_RaiseEncodeErrors, PyCodec_IgnoreEncodeErrors,
PyCodec_ReplaceEncodeErrors are globally visible because
they have to be available in _codecsmodule.c to wrap them as
Python function objects, but they can't be implemented in
_codecsmodule, because they need to be available to the
encoders in unicodeobject.c (through
PyCodec_EncodeHandlerForObject), but importing the codecs
module might result in an endless recursion, because
importing a module requires unpickling of the bytecode,
which might require decoding utf8, which ... (but this will
only happen, if we implement the same mechanism for the
decoding API)

I have not touched PyUnicode_TranslateCharmap yet, 
should this function also support error callbacks? Why would
one want the insert None into the mapping to call the callback?

A remaining problem is how to implement decoding error
callbacks. In Python 2.1 encoding and decoding errors are
handled in the same way with a string value. But with
callbacks it doesn't make sense to use the same callback for
encoding and decoding (like codecs.StreamReaderWriter and
codecs.StreamRecoder do). Decoding callbacks have a
different API. Which arguments should be passed to the
decoding callback, and what is the decoding callback
supposed to do?


----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-06-12 20:00

Message:
Logged In: YES 
user_id=38388

About the Py_UNICODE*data, int size APIs:
Ok, point taken.

In general, I think we ought to keep the callback feature as
open as possible, so passing in pointers and sizes would not
be very useful.

BTW, could you summarize how the callback works in a few
lines ?

About _Encode121: I'd name this _EncodeUCS1 since that's
what it is ;-)

About the new functions: I was referring to the new static
functions which you gave PyUnicode_... names. If these are
not supposed to turn into non-static functions, I'd rather
have them use lower case names (since that's how the Python
internals work too -- most of the times).



----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2001-06-12 18:56

Message:
Logged In: YES 
user_id=89016

> One thing which I don't like about your API change is that
> you removed the Py_UNICODE*data, int size style arguments
> --
> this makes it impossible to use the new APIs on non-Python
> data or data which is not available as Unicode object.

Another problem is, that the callback requires a Python
object, so in the PyObject *version, the refcount is
incref'd and the object is passed to the callback. The
Py_UNICODE*/int version would have to create a new Unicode
object from the data.


----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2001-06-12 18:32

Message:
Logged In: YES 
user_id=89016

> * please don't place more than one C statement on one line
> like in:
> """
> +               unicode = unicode2; unicodepos =
> unicode2pos;
> +               unicode2 = NULL; unicode2pos = 0;
> """

OK, done!

> * Comments should start with a capital letter and be
> prepended
> to the section they apply to

Fixed!

> * There should be spaces between arguments in compares
> (a == b) not (a==b)

Fixed!

> * Where does the name "...Encode121" originate ?

encode one-to-one, it implements both ASCII and latin-1
encoding.

> * module internal APIs should use lower case names (you
> converted some of these to  PyUnicode_...() -- this is
> normally reserved for APIs which are either marked as
> potential candidates for the public API or are very
> prominent in the code)

Which ones? I introduced a new function for every old one,
that had a "const char *errors" argument, and a few new ones
in codecs.h, of those PyCodec_EncodeHandlerForObject is
vital, because it is used to map for old string arguments to
the new function objects. PyCodec_RaiseEncodeErrors can be
used in the encoder implementation to raise an encode error,
but it could be made static in unicodeobject.h so only those
encoders implemented there have access to it.

> One thing which I don't like about your API change is that
> you removed the Py_UNICODE*data, int size style arguments > --
> this makes it impossible to use the new APIs on non-Python
> data or data which is not available as Unicode object.

I look through the code and found no situation where the
Py_UNICODE*/int version is really used and having two
(PyObject *)s (the original and the replacement string),
instead of UNICODE*/int and PyObject * made the
implementation a little easier, but I can fix that.

> Please separate the errors.c patch from this patch -- it
> seems totally unrelated to Unicode.

PyCodec_RaiseEncodeErrors uses this the have a \Uxxxx with
four hex digits. I removed it.

I'll upload a revised patch as soon as it's done.



----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-06-12 16:29

Message:
Logged In: YES 
user_id=38388

Thanks for the patch -- it looks very impressive !.

I'll give it a try later this week. 

Some first cosmetic tidbits:
* please don't place more than one C statement on one line
like in:
"""
+               unicode = unicode2; unicodepos =
unicode2pos;
+               unicode2 = NULL; unicode2pos = 0;
"""

* Comments should start with a capital letter and be
prepended
to the section they apply to

* There should be spaces between arguments in compares
(a == b) not (a==b)

* Where does the name "...Encode121" originate ?

* module internal APIs should use lower case names (you
converted some of these to  PyUnicode_...() -- this is
normally reserved for APIs which are either marked as
potential candidates for the public API or are very
prominent in the code)

One thing which I don't like about your API change is that
you removed the Py_UNICODE*data, int size style arguments --
this makes it impossible to use the new APIs on non-Python
data or data which is not available as Unicode object.

Please separate the errors.c patch from this patch -- it
seems totally unrelated to Unicode.

Thanks.


----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=432401&group_id=5470