PEP 293, Codec Error Handling Callbacks

Here's another new PEP. Bye, Walter Dörwald ---------------------------------------------------------------------- PEP: 293 Title: Codec Error Handling Callbacks Version: $Revision: 1.1 $ Last-Modified: $Date: 2002/06/19 03:22:11 $ Author: Walter Dörwald Status: Draft Type: Standards Track Created: 18-Jun-2002 Python-Version: 2.3 Post-History: Abstract This PEP aims at extending Python's fixed codec error handling schemes with a more flexible callback based approach. Python currently uses a fixed error handling for codec error handlers. This PEP describes a mechanism which allows Python to use function callbacks as error handlers. With these more flexible error handlers it is possible to add new functionality to existing codecs by e.g. providing fallback solutions or different encodings for cases where the standard codec mapping does not apply. Specification Currently the set of codec error handling algorithms is fixed to either "strict", "replace" or "ignore" and the semantics of these algorithms is implemented separately for each codec. The proposed patch will make the set of error handling algorithms extensible through a codec error handler registry which maps handler names to handler functions. This registry consists of the following two C functions: int PyCodec_RegisterError(const char *name, PyObject *error) PyObject *PyCodec_LookupError(const char *name) and their Python counterparts codecs.register_error(name, error) codecs.lookup_error(name) PyCodec_LookupError raises a LookupError if no callback function has been registered under this name. Similar to the encoding name registry there is no way of unregistering callback functions or iterating through the available functions. The callback functions will be used in the following way by the codecs: when the codec encounters an encoding/decoding error, the callback function is looked up by name, the information about the error is stored in an exception object and the callback is called with this object. The callback returns information about how to proceed (or raises an exception). For encoding, the exception object will look like this: class UnicodeEncodeError(UnicodeError): def __init__(self, encoding, object, start, end, reason): UnicodeError.__init__(self, "encoding '%s' can't encode characters " + "in positions %d-%d: %s" % (encoding, start, end-1, reason)) self.encoding = encoding self.object = object self.start = start self.end = end self.reason = reason This type will be implemented in C with the appropriate setter and getter methods for the attributes, which have the following meaning: * encoding: The name of the encoding; * object: The original unicode object for which encode() has been called; * start: The position of the first unencodable character; * end: (The position of the last unencodable character)+1 (or the length of object, if all characters from start to the end of object are unencodable); * reason: The reason why object[start:end] couldn't be encoded. If object has consecutive unencodable characters, the encoder should collect those characters for one call to the callback if those characters can't be encoded for the same reason. The encoder is not required to implement this behaviour but may call the callback for every single character, but it is strongly suggested that the collecting method is implemented. The callback must not modify the exception object. If the callback does not raise an exception (either the one passed in, or a different one), it must return a tuple: (replacement, newpos) replacement is a unicode object that the encoder will encode and emit instead of the unencodable object[start:end] part, newpos specifies a new position within object, where (after encoding the replacement) the encoder will continue encoding. If the replacement string itself contains an unencodable character the encoder raises the exception object (but may set a different reason string before raising). Should further encoding errors occur, the encoder is allowed to reuse the exception object for the next call to the callback. Furthermore the encoder is allowed to cache the result of codecs.lookup_error. If the callback does not know how to handle the exception, it must raise a TypeError. Decoding works similar to encoding with the following differences: The exception class is named UnicodeDecodeError and the attribute object is the original 8bit string that the decoder is currently decoding. The decoder will call the callback with those bytes that constitute one undecodable sequence, even if there is more than one undecodable sequence that is undecodable for the same reason directly after the first one. E.g. for the "unicode-escape" encoding, when decoding the illegal string "\\u00\\u01x", the callback will be called twice (once for "\\u00" and once for "\\u01"). This is done to be able to generate the correct number of replacement characters. The replacement returned from the callback is a unicode object that will be emitted by the decoder as-is without further processing instead of the undecodable object[start:end] part. There is a third API that uses the old strict/ignore/replace error handling scheme: PyUnicode_TranslateCharmap/unicode.translate The proposed patch will enhance PyUnicode_TranslateCharmap, so that it also supports the callback registry. This has the additional side effect that PyUnicode_TranslateCharmap will support multi-character replacement strings (see SF feature request #403100 [1]). For PyUnicode_TranslateCharmap the exception class will be named UnicodeTranslateError. PyUnicode_TranslateCharmap will collect all consecutive untranslatable characters (i.e. those that map to None) and call the callback with them. The replacement returned from the callback is a unicode object that will be put in the translated result as-is, without further processing. All encoders and decoders are allowed to implement the callback functionality themselves, if they recognize the callback name (i.e. if it is a system callback like "strict", "replace" and "ignore"). The proposed patch will add two additional system callback names: "backslashreplace" and "xmlcharrefreplace", which can be used for encoding and translating and which will also be implemented in-place for all encoders and PyUnicode_TranslateCharmap. The Python equivalent of these five callbacks will look like this: def strict(exc): raise exc def ignore(exc): if isinstance(exc, UnicodeError): return (u"", exc.end) else: raise TypeError("can't handle %s" % exc.__name__) def replace(exc): if isinstance(exc, UnicodeEncodeError): return ((exc.end-exc.start)*u"?", exc.end) elif isinstance(exc, UnicodeDecodeError): return (u"\\ufffd", exc.end) elif isinstance(exc, UnicodeTranslateError): return ((exc.end-exc.start)*u"\\ufffd", exc.end) else: raise TypeError("can't handle %s" % exc.__name__) def backslashreplace(exc): if isinstance(exc, (UnicodeEncodeError, UnicodeTranslateError)): s = u"" for c in exc.object[exc.start:exc.end]: if ord(c)<=0xff: s += u"\\x%02x" % ord(c) elif ord(c)<=0xffff: s += u"\\u%04x" % ord(c) else: s += u"\\U%08x" % ord(c) return (s, exc.end) else: raise TypeError("can't handle %s" % exc.__name__) def xmlcharrefreplace(exc): if isinstance(exc, (UnicodeEncodeError, UnicodeTranslateError)): s = u"" for c in exc.object[exc.start:exc.end]: s += u"&#%d;" % ord(c) return (s, exc.end) else: raise TypeError("can't handle %s" % exc.__name__) These five callback handlers will also be accessible to Python as codecs.strict_error, codecs.ignore_error, codecs.replace_error, codecs.backslashreplace_error and codecs.xmlcharrefreplace_error. Rationale Most legacy encoding do not support the full range of Unicode characters. For these cases many high level protocols support a way of escaping a Unicode character (e.g. Python itself supports the \x, \u and \U convention, XML supports character references via &#xxx; etc.). When implementing such an encoding algorithm, a problem with the current implementation of the encode method of Unicode objects becomes apparent: For determining which characters are unencodable by a certain encoding, every single character has to be tried, because encode does not provide any information about the location of the error(s), so # (1) us = u"xxx" s = us.encode(encoding) has to be replaced by # (2) us = u"xxx" v = [] for c in us: try: v.append(c.encode(encoding)) except UnicodeError: v.append("&#%d;" % ord(c)) s = "".join(v) This slows down encoding dramatically as now the loop through the string is done in Python code and no longer in C code. Furthermore this solution poses problems with stateful encodings. For example UTF-16 uses a Byte Order Mark at the start of the encoded byte string to specify the byte order. Using (2) with UTF-16, results in an 8 bit string with a BOM between every character. To work around this problem, a stream writer - which keeps state between calls to the encoding function - has to be used: # (3) us = u"xxx" import codecs, cStringIO as StringIO writer = codecs.getwriter(encoding) v = StringIO.StringIO() uv = writer(v) for c in us: try: uv.write(c) except UnicodeError: uv.write(u"&#%d;" % ord(c)) s = v.getvalue() To compare the speed of (1) and (3) the following test script has been used: # (4) import time us = u"äa"*1000000 encoding = "ascii" import codecs, cStringIO as StringIO t1 = time.time() s1 = us.encode(encoding, "replace") t2 = time.time() writer = codecs.getwriter(encoding) v = StringIO.StringIO() uv = writer(v) for c in us: try: uv.write(c) except UnicodeError: uv.write(u"?") s2 = v.getvalue() t3 = time.time() assert(s1==s2) print "1:", t2-t1 print "2:", t3-t2 print "factor:", (t3-t2)/(t2-t1) On Linux this gives the following output (with Python 2.3a0): 1: 0.274321913719 2: 51.1284689903 factor: 186.381278466 i.e. (3) is 180 times slower than (1). Codecs must be stateless, because as soon as a callback is registered it is available globally and can be called by multiple encode() calls. To be able to use stateful callbacks, the errors parameter for encode/decode/translate would have to be changed from char * to PyObject *, so that the callback could be used directly, without the need to register the callback globally. As this requires changes to lots of C prototypes, this approach was rejected. Currently all encoding/decoding functions have arguments const Py_UNICODE *p, int size or const char *p, int size to specify the unicode characters/8bit characters to be encoded/decoded. So in case of an error the codec has to create a new unicode or str object from these parameters and store it in the exception object. The callers of these encoding/decoding functions extract these parameters from str/unicode objects themselves most of the time, so it could speed up error handling if these object were passed directly. As this again requires changes to many C functions, this approach has been rejected. Implementation Notes A sample implementation is available as SourceForge patch #432401 [2]. The current version of this patch differs from the specification in the following way: * The error information is passed from the codec to the callback not as an exception object, but as a tuple, which has an additional entry state, which can be used for additional information the codec might want to pass to the callback. * There are two separate registries (one for encoding/translating and one for decoding) The class codecs.StreamReaderWriter uses the errors parameter for both reading and writing. To be more flexible this should probably be changed to two separate parameters for reading and writing. The errors parameter of PyUnicode_TranslateCharmap is not availably to Python, which makes testing of the new functionality of PyUnicode_TranslateCharmap impossible with Python scripts. The patch should add an optional argument errors to unicode.translate to expose the functionality and make testing possible. Codecs that do something different than encoding/decoding from/to unicode and want to use the new machinery can define their own exception classes and the strict handlers will automatically work with it. The other predefined error handlers are unicode specific and expect to get a Unicode(Encode|Decode|Translate)Error exception object so they won't work. Backwards Compatibility The semantics of unicode.encode with errors="replace" has changed: The old version always stored a ? character in the output string even if no character was mapped to ? in the mapping. With the proposed patch, the replacement string from the callback callback will again be looked up in the mapping dictionary. But as all supported encodings are ASCII based, and thus map ? to ?, this should not be a problem in practice. References [1] SF feature request #403100 "Multicharacter replacements in PyUnicode_TranslateCharmap" http://www.python.org/sf/403100 [2] SF patch #432401 "unicode encoding error callbacks" http://www.python.org/sf/432401 Copyright This document has been placed in the public domain. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 End:

I'd like to put the following PEP up for pronouncement. Walter is currently on vacation, but he asked me to already go ahead with the process. http://www.python.org/peps/pep-0293.html I like the patch a lot and the implementation strategy is very interesting as well (just wish that classes were new types -- then things could run a tad faster and the patch would be simpler). The basic idea of the patch is to provide a way to elegantly handle error situations in codecs which go beyond the standard cases 'ignore', 'replace' and 'strict', e.g. to automagically escape problem case, to log errors for later review or to fetch additional information for the proper handling at coding time (for example, fetching entity definitions from a URL). Thanks, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

On Mon, Aug 05, 2002 at 10:12:30AM +0200, M.-A. Lemburg wrote:
Here's another implementation strategy: Charmap entries can currently be None, an integer or a unicode string. I suggest adding another option: a function or other callable. The function will be called with the input string and current position as arguments and return a 2-tuple of the replacement string and number of characters consumed. This will make it very easy to take the decoding charmap of an existing codec and patch it with a special-case for one character like '&' to generate character references, for example. The function may raise an exception. The error strategy argument will not be overloaded with new functionality - it will just determine whether this exception will be ignored or passed on. An existing encoding charmap (usually a dictionary) can also be patched for special characters like <,>,&. A special entry with a None key will be the default entry used on a KeyError and will usually be mapped to a function. If no None key is present the charmap will behave exactly the way it does now. Tying it all together: A codec that does both charmap and entity reference translations may be dynamically generated. A function will be registered that intercepts any codec name that looks like 'xmlcharref.CODECNAME', import that codec, create patched charmaps and return the (enc, dec, reader, writer) tuple. The dynamically created entry will be cached for later use. Oren

Oren Tirosh wrote:
Even though that's possible, why add more magic to the codec registry ? u.encode('latin-1', 'xmlcharrefreplace') looks much clearer to me. You are of course free to write a codec which implements this directly. No change to the core is needed for that. However, PEP 293 addresses a much wider application space than just escaping unmappable characters. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Oren Tirosh <oren-py-d@hishome.net> writes:
Charmap entries can currently be None, an integer or a unicode string. I suggest adding another option: a function or other callable.
That helps only for a subset of all codecs (the charmap based ones), and thus is unacceptable. I want it to work for, say, big5 also. Regards, Martin

On Mon, Aug 05, 2002 at 10:08:43PM +0200, Martin v. Loewis wrote:
With the ability to embed functions inside a charmap big5 and other encodings could be converted to be charmap based, too :-) I just feel that there must be *some* simpler way. A patch with 87k of code scares the hell out of me. "There are no complex things. Only things that I haven't yet understood why they are really simple." Oren

Oren Tirosh <oren-py-d@hishome.net> writes:
With the ability to embed functions inside a charmap big5 and other encodings could be converted to be charmap based, too :-)
This is precisely what PEP 293 does: allow to embed functions in any codec.
I just feel that there must be *some* simpler way.
Why do you think so? It is not difficult.
A patch with 87k of code scares the hell out of me.
Ah, so it is the size of the patch? Some of it could be moved to Python perhaps, thus reducing the size of the patch (e.g. the registry comes to mind) If you look at the patch, you see that it precisely does what you propose to do: add a callback to the charmap codec: - it deletes charmap_decoding_error - it adds state to feed the callback function - it replaces the old call to charmap_decoding_error by ! outpos = p-PyUnicode_AS_UNICODE(v); ! startinpos = s-starts; ! endinpos = startinpos+1; ! if (unicode_decode_call_errorhandler( ! errors, &errorHandler, ! "charmap", "character maps to <undefined>", ! starts, size, &startinpos, &endinpos, &exc, &s, ! (PyObject **)&v, &outpos, &p)) {# (original code was) ! if (charmap_decoding_error(&s, &p, errors, ! "character maps to <undefined>")) { - likewise for encoding. Now, apply the same change to all other codecs (as you propose to do for big5), and you obtain the patch for PEP 293. In doing so, you find that the modifications needed for each codec are so similar that you add some supporting infrastructure, and correct errors in the existing codecs that you spot, and so on. The diffstat is Include/codecs.h | 37 Include/pyerrors.h | 67 + Lib/codecs.py | 5 Modules/_codecsmodule.c | 61 + Objects/stringobject.c | 7 Objects/unicodeobject.c | 1794 +++++++++++++-------!!!!!!!!!!!!!!!!!!!!!!!!!!!! Python/codecs.c | 399 ++++++++++ Python/exceptions.c | 603 ++++++++++++++++ 8 files changed, 1678 insertions(+), 236 deletions(-), 1059 modifications(!) If you look at the large blocks of new code, you find that it is in - charmap_encoding_error, which insists on implementing known error handling algorithms inline, - the default error handlers, of which atleast PyCodec_XMLCharRefReplaceErrors should be pure-Python - PyCodec_BackslashReplaceErrors, likewise, - the UnicodeError exception methods (which could be omitted, IMO). So, if you look at the patch, it isn't really that large. Regards, Martin

On Mon, Aug 05, 2002 at 11:06:25PM +0200, Martin v. Loewis wrote:
But it's NOT an error. It's new encoding functionality. What if the new functionality you've added this way has an error of its own? Perhaps you would like to have a flag to tell it whether to ignore error or raise an exception? Sorry, that argument has been taken over for another purpose. The real problem was some missing functionality in codecs. Here are two approaches to solve the problem: 1. Add the missing functionality. 2. Keep the old, limited functionality, let it fail, catch the error, re-use an argument originally intended for an error handling strategy to shoehorn a callback that can implement the missing functionality, add a new name-based registry to overcome the fact that the argument must be a string. Since this approach is conceptually stuck on treating it as an error it actually creates and discards a new exception object for each character converted via this path. Ummm... <scratches head>, tough choice. Oren

Oren Tirosh wrote:
Oren, if you just want a codec which encodes and decodes HTML entities, then this can be done easily by writing a codec which works on Unicode only and is stacked on top of the other existing codecs, e.g. if you first encode all non-printable and non-ASCII code points using entity escapes and then pass this Unicode string to one of the other codecs, you have a solution to your problem. Note that this is different from trying to provide a work-around for encoding code points from Unicode for which there are no corresponding mappings in a given encoding. These situations would normally result in an exception. Now HTML and XML offer you the possibility to use special escapes for these, so that you can still encode the complete Unicode set into e.g. ASCII, but only under the premises that the encoded data is HTML or XML text. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Oren Tirosh <oren-py-d@hishome.net> writes:
What is not an error? The handling? Certainly: the error and the error handler are different things; error handlers are not errors. "ignore" and "replace" are not errors, either, they are also new encoding functionality. That is the very nature of handlers: they add functionality.
That is not feasible, since you want that functionality also for codecs you haven't heard of.
That is possible, but inefficient. It is also the approach that people use today, and the reason for this PEP to exist. The current UnicodeError does not report any detail on the state that the codec was in.
It's worth: If you find that the entire string cannot be encoded, you have typically two choices: - you perform a binary search. That may cause log n exceptions. - you encode every character on its own. That reduce the number of exceptions to the number of unencodable characters, but it will also mean that the encoding is wrong for some encodings: You will always get the shift-in/shift-out sequences that your encoding may specify. On decoding, this is worse: feeding a byte at a time may fail altogether if you happen to break a multibyte character - when feeding the entire string happily consumes long sequences of characters, and only runs into a single problem byte. Regards, Martin

On Tue, Aug 06, 2002 at 10:25:34AM +0200, Martin v. Loewis wrote:
I'm confused. I have just described what PEP 293 is proposing and you say that it's inefficient :-? I find it hard to believe that this is what you relly meant since you are presumably in favor of this PEP in its current form. I can't tell if we actually disagree because apparently we don't understand each other.
Instead of treating it as a problem ("the string cannot be encoded") and getting trapped in the mindset of error handling I suggest approaching it from a positive point of view: "how can I make the encoding work the way I want it to work?". Let's leave the error handling for real errors. Treating this as an error-handling issue was so counter-intuitive to me that until recently I never bothered to read PEP 293. The title made me think that it's completely irrelevant to my needs. After all, what I wanted was to translate HTML to/from Unicode, not find a better way to handle errors. Oren

On Tuesday, August 6, 2002, at 11:20 , Oren Tirosh wrote:
I think that this is really also the gist of my misgiving about the design: enhancing a codec/adding extra filtering is a different thing than error handling. The PEP uses "error handing" in the prose, but the API is geared towards adding extra filtering. -- - Jack Jansen <Jack.Jansen@oratrix.com> http://www.cwi.nl/~jack - - If I can't dance I don't want to be part of your revolution -- Emma Goldman -

Jack Jansen wrote:
That's a wrong impression. The new error handling API allows you to do many different things base on the current position of the codec in the input stream. The fact that this can be used to apply escaping to otherwise illegal mappings stems from the basics behind this new API. It's an application, not its main purpose. Filtering can be had using different techniques such as by stacking codecs as well. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Oren Tirosh <oren-py-d@hishome.net> writes:
Perhaps I have misunderstood your description. I was assuming an algorithm like def new_encode(str, encoding, errors): return dispatch[errors](str, encoding) def xml_encode(str, encoding): try: return str.encode(encoding, "strict") except UnicodeError: if len(str) == 1: return "&#%d;" % ord(str) return xml_encode(str[:len(str)/2], encoding) + \ xml_encode(str[len(str)/2:], encoding) dispatch['xmlcharref'] = xml_encode This seems to match the description "keep the old, limited functionality, let it fail, catch the error", and it has all the deficiencies I mentioned. It also is not the meaning of PEP 293. The whole idea is that the handler is invoked *before* something has failed.
Sounds good, but how does this help in finding a solution?
If you think this is a documentation issue - I'm fine with documenting the feature differently. Regards, Martin

I know you want me to pronounce on this, but I'd like to abstain. I have no experience in using codecs to have any kind of sense about whether this is good or not. If you feel confident that it's good, you can make the decision on your own. If you'r not yet confident, I suggest getting more review. I do note that the patch is humungous (isn't everything related to Unicode? :-) so might need more review before it goes it. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Ok.
Walter has written a pretty good test suite for the patch and I have a good feeling about it. I'd like Walter to check it into CVS and then see whether the alpha tests bring up any quirks. The patch only touches the codecs and adds some new exceptions. There are no other changes involved. I think that together with PEP 263 (source code encoding) this is a great step forward in Python's i18n capabilities. BTW, the test script contains some examples of how to put the error callbacks to use: http://sourceforge.net/tracker/download.php?group_id=5470&atid=305470&file_id=27815&aid=432401 -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Sounds like a plan then. --Guido van Rossum (home page: http://www.python.org/~guido/)

I'm back from vacation. Comments on the thread and a list of open issues are below. Guido van Rossum wrote:
Does this mean we can check in the patch? Documentation is still missing and encoding specific decoding tests should be added to the test script. Has anybody except me and Marc-André tried the patch? On anything other than Linux/Intel? With UCS2 and UCS4? Martin v. Loewis wrote:
This is done for performance reasons.
- the default error handlers, of which atleast PyCodec_XMLCharRefReplaceErrors should be pure-Python
The PyCodec_XMLCharRefReplaceErrors functionality is independent of the rest, so moving this to Python won't reduce complexity that much. And it will slow down "xmlcharrefreplace" handling for those codecs that don't implement it inline.
- PyCodec_BackslashReplaceErrors, likewise,
- the UnicodeError exception methods (which could be omitted, IMO).
Those methods were implemented so that we can easily move to new style exceptions. The exception attributes can then be members of the C struct and the accessor functions can be simple macros. I guess some of the methods could be removed by moving duplicate ones to the base class UnicodeError, but this would break backwards compatibility. Oren Tirosh wrote:
The registry is name-based because this is required by the current C API. Passing the error handler directly as a function object would be simpler, but this can't be done, as it would require vast changes to the C API (an old version of the patch did that.) And this way we gain the benefit of implementing well-known error hanlding names inline. It is "yet another" registry exactly because encoding and error handling are completely orthogonal (at least for encoding). If you add a new error handler all codecs can use it (as long as they are aware of the new error handling way) and if you define a new codec it will work with all existing error handlers.
Generating an exception for each character that isn't handled by simple lookup probably adds quite a lot of overhead.
1. All encoders try to collect runs of unencodable characters to minimize the number of calls to the callback. 2. The PEP explicitely states that the codec is allowed to reuse the exception object. All codecs do this, so the exception object will only be created once (at most; when no error occurs, no exception object will be created) The exception object is just a quick way to pass information between the codec and the error handler and it could become even faster as soon as we get new style exceptions.
Not all codecs are charmap based. Open issues: 1. For each error handler two Python function objects are created: One in the registry and a different one in the codecs module. This means that e.g. codecs.lookup_error("replace") != codecs.replace_errors We can fix that by making the name ob the Python function object globally visible or by changing the codecs init function to do a lookup and use the result or simply by removing codecs.replace_errors 2. Currently charmap encoding uses a safe way for reallocation string storage, which tests available space on each output. This slows charmap encoding down a bit. This should probably be changed back to the old way: Test available space only for output strings longer than one character. 3. Error reporting logic in the exception attribute setters/getters may be non-standard. What is the standard way to report errors for C functions that don't return object pointers? ==0 for error and !=0 for success or ==0 for success and !=0 for error PyArg_ParseTuple returns true an success, PyObject_SetAttr returns true on failure, which one is the exception and which one the rule? 4. Assigning to an attribute of an exception object does not change the appropriate entry in the args attribute. Is this worth changing? 5. UTF-7 decoding does not yet take full advantage of the machinery: When an unterminated shift sequence is encountered (e.g. "+xxx") the faulty byte sequence has already been emitted. Bye, Walter Dörwald

Walter Dörwald wrote:
I'm back from vacation. Comments on the thread and a list of open issues are below.
I'm going on vacation for two weeks, so you'll have to take it along from here. Have fun, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Walter Dörwald <walter@livinglogic.de> writes:
Is that really worth it? Such errors are rare, and when they occur, they usually cause an exception as the result of the "strict" error handling. I'd strongly encourage you to avoid duplication of code, and use Python whereever possible.
Sure it will. But how much does that matter in the overall context of generating HTML/XML?
What are new-style exceptions?
The exception attributes can then be members of the C struct and the accessor functions can be simple macros.
Again, I sense premature optimization.
Why would this be a problem?
I recommend to fix this by implementing the registry in Python.
No. Exception objects should be treated as immutable (even if they aren't). If somebody complains, we can fix it; until then, it suffices if this is documented.
It would be ok if it works as good as it did in 2.2. UTF-7 is rarely used; if it is used, it is machine-generated, so there shouldn't be any errors. Regards, Martin

Martin v. Loewis wrote:
See below: this is not always possible; much for the same reason that exceptions are implemented in C as well.
Exceptions that are built as subclassable types.
There's nothing premature here. By moving exception handling to C level, you get *much* better performance than at Python level. Remember that applications like e.g. escaping chars in an XML document can cause lots of these exceptions to be generated.
This doesn't work as I've already explained before. The predefined error handling modes of builtin codecs must work with relying on the Python import mechanism.
What ? That exceptions are immutable ? I think it's a big win that exceptions are in fact mutable -- they are great for transporting extra information up the chain... try: ... except Exception, obj: obj.been_there = 1 raise
Right. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

"M.-A. Lemburg" <mal@lemburg.com> writes:
What are new-style exceptions?
Exceptions that are built as subclassable types.
Exceptions first of all inherit from Exception. When/if Exception stops being a class, we'll have to deal with more issues than the PEP 293 exceptions.
There's nothing premature here. By moving exception handling to C level, you get *much* better performance than at Python level.
Can you give a specific example: What Python code, how much better performance?
You mean "without"? Where did you explain this before? And why is that? Guido argues that more of the central interpreter machinery must be moved to Python - I can't see why codecs should be an exception here.
I see. So this is an open issue. Regards, Martin

Martin v. Loewis wrote:
Right. It would be nice to have classes or at least exceptions turn into new-style types as well. Then you'd have access to slots and all the other goodies which make a great difference in accessing performance at C level.
Walter has the details here.
Right. s/with/without/.
Where did you explain this before?
Hmm, I remember having posted the reasoning I gave here in another response on this thread, but I can't find it at the moment.
The problem is the same as what we had with the exceptions.py module early on in the 1.6 alphas: if this module isn't found all kinds of things start failing. The same would happen when you start to use builtin codecs which have external error handler implementation as .py files, e.g. unicode('utf-8', 'replace') could then fail because of an ImportError. For the charmap codec it's mostly about performance. I don't have objections for other codecs which rely on external resources.
I wouldn't call it an issue. It's a feature :-) (and one that makes Python's exception mechanism very powerful) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

"M.-A. Lemburg" <mal@lemburg.com> writes:
What kinds of things would start failing? If you get an interactive prompt (i.e. Python still manages to start up), or you get a traceback indicating the problem in non-interactive mode, I don't see this as a problem - *of course* Python will stop working if you remove essential files. This is like saying you expect the interpreter to continue to work after you remove python23.dll. So, if your worry is that things would not work if you remove a Python file - don't worry. Python already relies on Python files being present in various places.
Please remember that we are still about error handling here, and that the normal case will be "strict", which usually results in aborting the computation. So I don't see the performance issue even for the charmap codec. Regards, Martin

Martin v. Loewis wrote:
Yes, but I think this is not that much of a problem, because when the code that catches the exception wants to do something with exc.args it has to know what the entries mean, which depends on the type. And if this code knows that it is dealing with a UnicodeEncodeError it can simply use exc.start instead of exc.args[2]. Bye, Walter Dörwald

Martin v. Loewis wrote:
Of course it's irrelevant how fast the exception is raised, but it could be important for the handlers that really do a replacement.
See the attached test script. It encodes 100 versions of the german text on http://www.gutenberg2000.de/grimm/maerchen/tapfere.htm Output is as follows: 1790000 chars, 2.330% unenc ignore: 0.022 (factor=1.000) xmlcharrefreplace: 0.044 (factor=1.962) xml2: 0.267 (factor=12.003) xml3: 0.723 (factor=32.506) workaround: 5.151 (factor=231.702) i.e. a 1.7MB string with 2.3% unencodable characters was encoded. Using the the inline xmlcharrefplace instead of ignore is half as fast. Using a callback instead of the inline implementation is a factor of 12 slower than ignore. Using the Python implementation of the callback is a factor of 32 slower and using the pre-PEP workaround is a factor of 231 slower. Replacing every unencodable character with u"\u4242" and using "iso-8859-15" gives: ignore: 0.351 (factor=1.000) xmlcharrefreplace: 0.390 (factor=1.113) xml2: 0.653 (factor=1.862) xml3: 1.137 (factor=3.244) workaround: 12.310 (factor=35.117)
No it's more like anticipating change.
It's just unintuitive.
Even simpler would be to move the initialization of the module variables from Modules/_codecsmodule.c to Lib/codecs.py. There is no need for them to be available in _codecs. All that is required for this change is to add strict_errors = lookup_error("strict") ignore_errors = lookup_error("ignore") replace_errors = lookup_error("replace") xmlcharrefreplace_errors = lookup_error("xmlcharrefreplace") backslashreplace_errors = lookup_error("backslashreplace") to codecs.py The registry should be available via two simple C APIs, just like the encoding registry.
The codecs in the PEP *do* modify attributes of the exception object.
If somebody complains, we can fix it; until then, it suffices if this is documented.
It can't really be fixed for codecs implemented in Python. For codecs that use the C functions we could add the functionality that e.g. PyUnicodeEncodeError_SetReason(exc) sets exc.reason and exc.args[3], but AFAICT it can't be done easily for Python where attribute assignment directly goes to the instance dict. If those exception classes were new style classes it would be simple, because the attributes would be properties and args would probably be generated lazily.
It does:
"+xxx".decode("utf-7", "replace") u'\uc71c\ufffd'
althought the result should probably have been u'\ufffd'. Bye, Walter Dörwald import codecs, time def xml3(exc): if isinstance(exc, UnicodeEncodeError): return (u"".join([ u"&#%d;" % ord(c) for c in exc.object[exc.start:exc.end]]), exc.end) else: raise TypeError("don't know how to handle %r" % exc) count = 0 def check(exc): global count count += exc.end-exc.start return (u"", exc.end) codecs.register_error("xmlcheck", check) codecs.register_error("xml2", codecs.xmlcharrefreplace_errors) codecs.register_error("xml3", xml3) l = 100 s = unicode(open("tapferschneider.txt").read(), "latin-1") s *= l s.encode("ascii", "xmlcheck") print "%d chars, %.03f%% unenc" % (len(s), 100.*(float(count)/len(s))) handlers = ["ignore", "xmlcharrefreplace", "xml2", "xml3"] times = [0]*(len(handlers)+1) res = [0]*(len(handlers)+1) for (i, h) in enumerate(handlers): t1 = time.time() res[i] = s.encode("ascii", h) t2 = time.time() times[i] = t2-t1 print "%s: %.03f (factor=%.03f)" % (handlers[i], times[i], times[i]/times[0]) i = len(handlers) t1 = time.time() v = [] for c in s: try: v.append(c.encode("ascii")) except UnicodeError: v.append("&#%d;" % ord(c)) res[i] = "".join(v) t2 = time.time() times[i] = t2-t1 print "workaround: %.03f (factor=%.03f)" % (times[i], times[i]/times[0])

Walter Dörwald <walter@livinglogic.de> writes:
Those numbers are impressive. Can you please add def xml4(exc): if isinstance(exc, UnicodeEncodeError): if exc.end-exc.start == 1: return u"&#"+str(ord(exc.object[exc.start]))+u";" else: r = [] for c in exc.object[exc.start:exc.end]: r.extend([u"&#", str(ord(c)), u";"]) return u"".join(r) else: raise TypeError("don't know how to handle %r" % exc) and report how that performs (assuming I made no error)?
Using a callback instead of the inline implementation is a factor of 12 slower than ignore.
For the purpose of comparing C and Python, this isn't relevant, is it? Only the C version of xmlcharrefreplace and a Python version should be compared.
You could add methods into the class set_reason etc, which error handler authors would have to use. Again, these methods could be added through Python code, so no C code would be necessary to implemenet them. You could even implement a setattr method in Python - although you'ld have to search this from C while initializing the class. Regards, Martin

Martin v. Loewis wrote:
You must return a tuple (replacement, new input position) otherwise the code is correct. It tried it and two new versions: def xml5(exc): if isinstance(exc, UnicodeEncodeError): return (u"&#%d;" % ord(exc.object[exc.start]), exc.start+1) else: raise TypeError("don't know how to handle %r" % exc) def xml6(exc): if isinstance(exc, UnicodeEncodeError): return (u"&#" + str(ord(exc.object[exc.start]) + u";"), exc.start+1) else: raise TypeError("don't know how to handle %r" % exc) Here are the results: 1790000 chars, 2.330% unenc ignore: 0.022 (factor=1.000) xmlcharrefreplace: 0.042 (factor=1.935) xml2: 0.264 (factor=12.084) xml3: 0.733 (factor=33.529) xml4: 0.504 (factor=23.057) xml5: 0.474 (factor=21.649) xml6: 0.481 (factor=22.010) workaround: 5.138 (factor=234.862)
I was just to lazy to code this. ;) Python is a factor of 2.7 slower than the C callback (or 1.9 for your version).
For me this sounds much more complicated than the current C functions, especially for using them from C, which most codecs probably will. Bye, Walter Dörwald

Having to register the error handler first and then finding it by name smells like a very big hack to me. I understand the reasoning (that you don't want to modify the API of a gazillion C routines to add an error object argument) but it still seems like a hack.... -- - Jack Jansen <Jack.Jansen@oratrix.com> http://www.cwi.nl/~jack - - If I can't dance I don't want to be part of your revolution -- Emma Goldman -

Jack Jansen wrote:
Well, in that case, you would have to call the whole codec registry a hack ;-) I find having the callback available by an alias name very user friendly, but YMMV. The main reason behind this way of doing it is to maintain C API compatibility without adding a complete new b/w compatiblity layer (Walter started out this way; see the SF patch page). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

On maandag, augustus 5, 2002, at 05:47 , M.-A. Lemburg wrote:
No, not really. For codecs I think that there needn't be much of a connection between the codec-supplier and the codec-user. Conceivably the encoding-identifying string being passed to encode() could even have been read from a data file or something. For error handling this is silly: the code calling encode() or decode() will know how it wants errors handled. And if you argue that it isn't really error handling but an extension to the encoding name then maybe it should be treated as such (by appending it to the codec name in the string, as in "ascii;xmlentitydefs" or so?). -- - Jack Jansen <Jack.Jansen@oratrix.com> http://www.cwi.nl/~jack - - If I can't dance I don't want to be part of your revolution -- Emma Goldman -

Jack Jansen wrote:
You are omitting the fact, though, that different codecs may need different implementations of a specific error handler. Now the error handler will always implement the same logic, so to the users it's all the same thing. And by using the string alias he needn't worry about where to get the error handler from (it typically lives with the codec itself). Note that error handling is not really an extension to the encoding itself. It just happens that it can be put to use that way for e.g. escaping non-representable characters. Other applications like fetching extra information from a external sources or logging the positions of coding problems do not fall into this category. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

I'd like to put the following PEP up for pronouncement. Walter is currently on vacation, but he asked me to already go ahead with the process. http://www.python.org/peps/pep-0293.html I like the patch a lot and the implementation strategy is very interesting as well (just wish that classes were new types -- then things could run a tad faster and the patch would be simpler). The basic idea of the patch is to provide a way to elegantly handle error situations in codecs which go beyond the standard cases 'ignore', 'replace' and 'strict', e.g. to automagically escape problem case, to log errors for later review or to fetch additional information for the proper handling at coding time (for example, fetching entity definitions from a URL). Thanks, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

On Mon, Aug 05, 2002 at 10:12:30AM +0200, M.-A. Lemburg wrote:
Here's another implementation strategy: Charmap entries can currently be None, an integer or a unicode string. I suggest adding another option: a function or other callable. The function will be called with the input string and current position as arguments and return a 2-tuple of the replacement string and number of characters consumed. This will make it very easy to take the decoding charmap of an existing codec and patch it with a special-case for one character like '&' to generate character references, for example. The function may raise an exception. The error strategy argument will not be overloaded with new functionality - it will just determine whether this exception will be ignored or passed on. An existing encoding charmap (usually a dictionary) can also be patched for special characters like <,>,&. A special entry with a None key will be the default entry used on a KeyError and will usually be mapped to a function. If no None key is present the charmap will behave exactly the way it does now. Tying it all together: A codec that does both charmap and entity reference translations may be dynamically generated. A function will be registered that intercepts any codec name that looks like 'xmlcharref.CODECNAME', import that codec, create patched charmaps and return the (enc, dec, reader, writer) tuple. The dynamically created entry will be cached for later use. Oren

Oren Tirosh wrote:
Even though that's possible, why add more magic to the codec registry ? u.encode('latin-1', 'xmlcharrefreplace') looks much clearer to me. You are of course free to write a codec which implements this directly. No change to the core is needed for that. However, PEP 293 addresses a much wider application space than just escaping unmappable characters. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Oren Tirosh <oren-py-d@hishome.net> writes:
Charmap entries can currently be None, an integer or a unicode string. I suggest adding another option: a function or other callable.
That helps only for a subset of all codecs (the charmap based ones), and thus is unacceptable. I want it to work for, say, big5 also. Regards, Martin

On Mon, Aug 05, 2002 at 10:08:43PM +0200, Martin v. Loewis wrote:
With the ability to embed functions inside a charmap big5 and other encodings could be converted to be charmap based, too :-) I just feel that there must be *some* simpler way. A patch with 87k of code scares the hell out of me. "There are no complex things. Only things that I haven't yet understood why they are really simple." Oren

Oren Tirosh <oren-py-d@hishome.net> writes:
With the ability to embed functions inside a charmap big5 and other encodings could be converted to be charmap based, too :-)
This is precisely what PEP 293 does: allow to embed functions in any codec.
I just feel that there must be *some* simpler way.
Why do you think so? It is not difficult.
A patch with 87k of code scares the hell out of me.
Ah, so it is the size of the patch? Some of it could be moved to Python perhaps, thus reducing the size of the patch (e.g. the registry comes to mind) If you look at the patch, you see that it precisely does what you propose to do: add a callback to the charmap codec: - it deletes charmap_decoding_error - it adds state to feed the callback function - it replaces the old call to charmap_decoding_error by ! outpos = p-PyUnicode_AS_UNICODE(v); ! startinpos = s-starts; ! endinpos = startinpos+1; ! if (unicode_decode_call_errorhandler( ! errors, &errorHandler, ! "charmap", "character maps to <undefined>", ! starts, size, &startinpos, &endinpos, &exc, &s, ! (PyObject **)&v, &outpos, &p)) {# (original code was) ! if (charmap_decoding_error(&s, &p, errors, ! "character maps to <undefined>")) { - likewise for encoding. Now, apply the same change to all other codecs (as you propose to do for big5), and you obtain the patch for PEP 293. In doing so, you find that the modifications needed for each codec are so similar that you add some supporting infrastructure, and correct errors in the existing codecs that you spot, and so on. The diffstat is Include/codecs.h | 37 Include/pyerrors.h | 67 + Lib/codecs.py | 5 Modules/_codecsmodule.c | 61 + Objects/stringobject.c | 7 Objects/unicodeobject.c | 1794 +++++++++++++-------!!!!!!!!!!!!!!!!!!!!!!!!!!!! Python/codecs.c | 399 ++++++++++ Python/exceptions.c | 603 ++++++++++++++++ 8 files changed, 1678 insertions(+), 236 deletions(-), 1059 modifications(!) If you look at the large blocks of new code, you find that it is in - charmap_encoding_error, which insists on implementing known error handling algorithms inline, - the default error handlers, of which atleast PyCodec_XMLCharRefReplaceErrors should be pure-Python - PyCodec_BackslashReplaceErrors, likewise, - the UnicodeError exception methods (which could be omitted, IMO). So, if you look at the patch, it isn't really that large. Regards, Martin

On Mon, Aug 05, 2002 at 11:06:25PM +0200, Martin v. Loewis wrote:
But it's NOT an error. It's new encoding functionality. What if the new functionality you've added this way has an error of its own? Perhaps you would like to have a flag to tell it whether to ignore error or raise an exception? Sorry, that argument has been taken over for another purpose. The real problem was some missing functionality in codecs. Here are two approaches to solve the problem: 1. Add the missing functionality. 2. Keep the old, limited functionality, let it fail, catch the error, re-use an argument originally intended for an error handling strategy to shoehorn a callback that can implement the missing functionality, add a new name-based registry to overcome the fact that the argument must be a string. Since this approach is conceptually stuck on treating it as an error it actually creates and discards a new exception object for each character converted via this path. Ummm... <scratches head>, tough choice. Oren

Oren Tirosh wrote:
Oren, if you just want a codec which encodes and decodes HTML entities, then this can be done easily by writing a codec which works on Unicode only and is stacked on top of the other existing codecs, e.g. if you first encode all non-printable and non-ASCII code points using entity escapes and then pass this Unicode string to one of the other codecs, you have a solution to your problem. Note that this is different from trying to provide a work-around for encoding code points from Unicode for which there are no corresponding mappings in a given encoding. These situations would normally result in an exception. Now HTML and XML offer you the possibility to use special escapes for these, so that you can still encode the complete Unicode set into e.g. ASCII, but only under the premises that the encoded data is HTML or XML text. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Oren Tirosh <oren-py-d@hishome.net> writes:
What is not an error? The handling? Certainly: the error and the error handler are different things; error handlers are not errors. "ignore" and "replace" are not errors, either, they are also new encoding functionality. That is the very nature of handlers: they add functionality.
That is not feasible, since you want that functionality also for codecs you haven't heard of.
That is possible, but inefficient. It is also the approach that people use today, and the reason for this PEP to exist. The current UnicodeError does not report any detail on the state that the codec was in.
It's worth: If you find that the entire string cannot be encoded, you have typically two choices: - you perform a binary search. That may cause log n exceptions. - you encode every character on its own. That reduce the number of exceptions to the number of unencodable characters, but it will also mean that the encoding is wrong for some encodings: You will always get the shift-in/shift-out sequences that your encoding may specify. On decoding, this is worse: feeding a byte at a time may fail altogether if you happen to break a multibyte character - when feeding the entire string happily consumes long sequences of characters, and only runs into a single problem byte. Regards, Martin

On Tue, Aug 06, 2002 at 10:25:34AM +0200, Martin v. Loewis wrote:
I'm confused. I have just described what PEP 293 is proposing and you say that it's inefficient :-? I find it hard to believe that this is what you relly meant since you are presumably in favor of this PEP in its current form. I can't tell if we actually disagree because apparently we don't understand each other.
Instead of treating it as a problem ("the string cannot be encoded") and getting trapped in the mindset of error handling I suggest approaching it from a positive point of view: "how can I make the encoding work the way I want it to work?". Let's leave the error handling for real errors. Treating this as an error-handling issue was so counter-intuitive to me that until recently I never bothered to read PEP 293. The title made me think that it's completely irrelevant to my needs. After all, what I wanted was to translate HTML to/from Unicode, not find a better way to handle errors. Oren

On Tuesday, August 6, 2002, at 11:20 , Oren Tirosh wrote:
I think that this is really also the gist of my misgiving about the design: enhancing a codec/adding extra filtering is a different thing than error handling. The PEP uses "error handing" in the prose, but the API is geared towards adding extra filtering. -- - Jack Jansen <Jack.Jansen@oratrix.com> http://www.cwi.nl/~jack - - If I can't dance I don't want to be part of your revolution -- Emma Goldman -

Jack Jansen wrote:
That's a wrong impression. The new error handling API allows you to do many different things base on the current position of the codec in the input stream. The fact that this can be used to apply escaping to otherwise illegal mappings stems from the basics behind this new API. It's an application, not its main purpose. Filtering can be had using different techniques such as by stacking codecs as well. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Oren Tirosh <oren-py-d@hishome.net> writes:
Perhaps I have misunderstood your description. I was assuming an algorithm like def new_encode(str, encoding, errors): return dispatch[errors](str, encoding) def xml_encode(str, encoding): try: return str.encode(encoding, "strict") except UnicodeError: if len(str) == 1: return "&#%d;" % ord(str) return xml_encode(str[:len(str)/2], encoding) + \ xml_encode(str[len(str)/2:], encoding) dispatch['xmlcharref'] = xml_encode This seems to match the description "keep the old, limited functionality, let it fail, catch the error", and it has all the deficiencies I mentioned. It also is not the meaning of PEP 293. The whole idea is that the handler is invoked *before* something has failed.
Sounds good, but how does this help in finding a solution?
If you think this is a documentation issue - I'm fine with documenting the feature differently. Regards, Martin

I know you want me to pronounce on this, but I'd like to abstain. I have no experience in using codecs to have any kind of sense about whether this is good or not. If you feel confident that it's good, you can make the decision on your own. If you'r not yet confident, I suggest getting more review. I do note that the patch is humungous (isn't everything related to Unicode? :-) so might need more review before it goes it. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Ok.
Walter has written a pretty good test suite for the patch and I have a good feeling about it. I'd like Walter to check it into CVS and then see whether the alpha tests bring up any quirks. The patch only touches the codecs and adds some new exceptions. There are no other changes involved. I think that together with PEP 263 (source code encoding) this is a great step forward in Python's i18n capabilities. BTW, the test script contains some examples of how to put the error callbacks to use: http://sourceforge.net/tracker/download.php?group_id=5470&atid=305470&file_id=27815&aid=432401 -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Sounds like a plan then. --Guido van Rossum (home page: http://www.python.org/~guido/)

I'm back from vacation. Comments on the thread and a list of open issues are below. Guido van Rossum wrote:
Does this mean we can check in the patch? Documentation is still missing and encoding specific decoding tests should be added to the test script. Has anybody except me and Marc-André tried the patch? On anything other than Linux/Intel? With UCS2 and UCS4? Martin v. Loewis wrote:
This is done for performance reasons.
- the default error handlers, of which atleast PyCodec_XMLCharRefReplaceErrors should be pure-Python
The PyCodec_XMLCharRefReplaceErrors functionality is independent of the rest, so moving this to Python won't reduce complexity that much. And it will slow down "xmlcharrefreplace" handling for those codecs that don't implement it inline.
- PyCodec_BackslashReplaceErrors, likewise,
- the UnicodeError exception methods (which could be omitted, IMO).
Those methods were implemented so that we can easily move to new style exceptions. The exception attributes can then be members of the C struct and the accessor functions can be simple macros. I guess some of the methods could be removed by moving duplicate ones to the base class UnicodeError, but this would break backwards compatibility. Oren Tirosh wrote:
The registry is name-based because this is required by the current C API. Passing the error handler directly as a function object would be simpler, but this can't be done, as it would require vast changes to the C API (an old version of the patch did that.) And this way we gain the benefit of implementing well-known error hanlding names inline. It is "yet another" registry exactly because encoding and error handling are completely orthogonal (at least for encoding). If you add a new error handler all codecs can use it (as long as they are aware of the new error handling way) and if you define a new codec it will work with all existing error handlers.
Generating an exception for each character that isn't handled by simple lookup probably adds quite a lot of overhead.
1. All encoders try to collect runs of unencodable characters to minimize the number of calls to the callback. 2. The PEP explicitely states that the codec is allowed to reuse the exception object. All codecs do this, so the exception object will only be created once (at most; when no error occurs, no exception object will be created) The exception object is just a quick way to pass information between the codec and the error handler and it could become even faster as soon as we get new style exceptions.
Not all codecs are charmap based. Open issues: 1. For each error handler two Python function objects are created: One in the registry and a different one in the codecs module. This means that e.g. codecs.lookup_error("replace") != codecs.replace_errors We can fix that by making the name ob the Python function object globally visible or by changing the codecs init function to do a lookup and use the result or simply by removing codecs.replace_errors 2. Currently charmap encoding uses a safe way for reallocation string storage, which tests available space on each output. This slows charmap encoding down a bit. This should probably be changed back to the old way: Test available space only for output strings longer than one character. 3. Error reporting logic in the exception attribute setters/getters may be non-standard. What is the standard way to report errors for C functions that don't return object pointers? ==0 for error and !=0 for success or ==0 for success and !=0 for error PyArg_ParseTuple returns true an success, PyObject_SetAttr returns true on failure, which one is the exception and which one the rule? 4. Assigning to an attribute of an exception object does not change the appropriate entry in the args attribute. Is this worth changing? 5. UTF-7 decoding does not yet take full advantage of the machinery: When an unterminated shift sequence is encountered (e.g. "+xxx") the faulty byte sequence has already been emitted. Bye, Walter Dörwald

Walter Dörwald wrote:
I'm back from vacation. Comments on the thread and a list of open issues are below.
I'm going on vacation for two weeks, so you'll have to take it along from here. Have fun, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Walter Dörwald <walter@livinglogic.de> writes:
Is that really worth it? Such errors are rare, and when they occur, they usually cause an exception as the result of the "strict" error handling. I'd strongly encourage you to avoid duplication of code, and use Python whereever possible.
Sure it will. But how much does that matter in the overall context of generating HTML/XML?
What are new-style exceptions?
The exception attributes can then be members of the C struct and the accessor functions can be simple macros.
Again, I sense premature optimization.
Why would this be a problem?
I recommend to fix this by implementing the registry in Python.
No. Exception objects should be treated as immutable (even if they aren't). If somebody complains, we can fix it; until then, it suffices if this is documented.
It would be ok if it works as good as it did in 2.2. UTF-7 is rarely used; if it is used, it is machine-generated, so there shouldn't be any errors. Regards, Martin

Martin v. Loewis wrote:
See below: this is not always possible; much for the same reason that exceptions are implemented in C as well.
Exceptions that are built as subclassable types.
There's nothing premature here. By moving exception handling to C level, you get *much* better performance than at Python level. Remember that applications like e.g. escaping chars in an XML document can cause lots of these exceptions to be generated.
This doesn't work as I've already explained before. The predefined error handling modes of builtin codecs must work with relying on the Python import mechanism.
What ? That exceptions are immutable ? I think it's a big win that exceptions are in fact mutable -- they are great for transporting extra information up the chain... try: ... except Exception, obj: obj.been_there = 1 raise
Right. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

"M.-A. Lemburg" <mal@lemburg.com> writes:
What are new-style exceptions?
Exceptions that are built as subclassable types.
Exceptions first of all inherit from Exception. When/if Exception stops being a class, we'll have to deal with more issues than the PEP 293 exceptions.
There's nothing premature here. By moving exception handling to C level, you get *much* better performance than at Python level.
Can you give a specific example: What Python code, how much better performance?
You mean "without"? Where did you explain this before? And why is that? Guido argues that more of the central interpreter machinery must be moved to Python - I can't see why codecs should be an exception here.
I see. So this is an open issue. Regards, Martin

Martin v. Loewis wrote:
Right. It would be nice to have classes or at least exceptions turn into new-style types as well. Then you'd have access to slots and all the other goodies which make a great difference in accessing performance at C level.
Walter has the details here.
Right. s/with/without/.
Where did you explain this before?
Hmm, I remember having posted the reasoning I gave here in another response on this thread, but I can't find it at the moment.
The problem is the same as what we had with the exceptions.py module early on in the 1.6 alphas: if this module isn't found all kinds of things start failing. The same would happen when you start to use builtin codecs which have external error handler implementation as .py files, e.g. unicode('utf-8', 'replace') could then fail because of an ImportError. For the charmap codec it's mostly about performance. I don't have objections for other codecs which rely on external resources.
I wouldn't call it an issue. It's a feature :-) (and one that makes Python's exception mechanism very powerful) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

"M.-A. Lemburg" <mal@lemburg.com> writes:
What kinds of things would start failing? If you get an interactive prompt (i.e. Python still manages to start up), or you get a traceback indicating the problem in non-interactive mode, I don't see this as a problem - *of course* Python will stop working if you remove essential files. This is like saying you expect the interpreter to continue to work after you remove python23.dll. So, if your worry is that things would not work if you remove a Python file - don't worry. Python already relies on Python files being present in various places.
Please remember that we are still about error handling here, and that the normal case will be "strict", which usually results in aborting the computation. So I don't see the performance issue even for the charmap codec. Regards, Martin

Martin v. Loewis wrote:
Yes, but I think this is not that much of a problem, because when the code that catches the exception wants to do something with exc.args it has to know what the entries mean, which depends on the type. And if this code knows that it is dealing with a UnicodeEncodeError it can simply use exc.start instead of exc.args[2]. Bye, Walter Dörwald

Martin v. Loewis wrote:
Of course it's irrelevant how fast the exception is raised, but it could be important for the handlers that really do a replacement.
See the attached test script. It encodes 100 versions of the german text on http://www.gutenberg2000.de/grimm/maerchen/tapfere.htm Output is as follows: 1790000 chars, 2.330% unenc ignore: 0.022 (factor=1.000) xmlcharrefreplace: 0.044 (factor=1.962) xml2: 0.267 (factor=12.003) xml3: 0.723 (factor=32.506) workaround: 5.151 (factor=231.702) i.e. a 1.7MB string with 2.3% unencodable characters was encoded. Using the the inline xmlcharrefplace instead of ignore is half as fast. Using a callback instead of the inline implementation is a factor of 12 slower than ignore. Using the Python implementation of the callback is a factor of 32 slower and using the pre-PEP workaround is a factor of 231 slower. Replacing every unencodable character with u"\u4242" and using "iso-8859-15" gives: ignore: 0.351 (factor=1.000) xmlcharrefreplace: 0.390 (factor=1.113) xml2: 0.653 (factor=1.862) xml3: 1.137 (factor=3.244) workaround: 12.310 (factor=35.117)
No it's more like anticipating change.
It's just unintuitive.
Even simpler would be to move the initialization of the module variables from Modules/_codecsmodule.c to Lib/codecs.py. There is no need for them to be available in _codecs. All that is required for this change is to add strict_errors = lookup_error("strict") ignore_errors = lookup_error("ignore") replace_errors = lookup_error("replace") xmlcharrefreplace_errors = lookup_error("xmlcharrefreplace") backslashreplace_errors = lookup_error("backslashreplace") to codecs.py The registry should be available via two simple C APIs, just like the encoding registry.
The codecs in the PEP *do* modify attributes of the exception object.
If somebody complains, we can fix it; until then, it suffices if this is documented.
It can't really be fixed for codecs implemented in Python. For codecs that use the C functions we could add the functionality that e.g. PyUnicodeEncodeError_SetReason(exc) sets exc.reason and exc.args[3], but AFAICT it can't be done easily for Python where attribute assignment directly goes to the instance dict. If those exception classes were new style classes it would be simple, because the attributes would be properties and args would probably be generated lazily.
It does:
"+xxx".decode("utf-7", "replace") u'\uc71c\ufffd'
althought the result should probably have been u'\ufffd'. Bye, Walter Dörwald import codecs, time def xml3(exc): if isinstance(exc, UnicodeEncodeError): return (u"".join([ u"&#%d;" % ord(c) for c in exc.object[exc.start:exc.end]]), exc.end) else: raise TypeError("don't know how to handle %r" % exc) count = 0 def check(exc): global count count += exc.end-exc.start return (u"", exc.end) codecs.register_error("xmlcheck", check) codecs.register_error("xml2", codecs.xmlcharrefreplace_errors) codecs.register_error("xml3", xml3) l = 100 s = unicode(open("tapferschneider.txt").read(), "latin-1") s *= l s.encode("ascii", "xmlcheck") print "%d chars, %.03f%% unenc" % (len(s), 100.*(float(count)/len(s))) handlers = ["ignore", "xmlcharrefreplace", "xml2", "xml3"] times = [0]*(len(handlers)+1) res = [0]*(len(handlers)+1) for (i, h) in enumerate(handlers): t1 = time.time() res[i] = s.encode("ascii", h) t2 = time.time() times[i] = t2-t1 print "%s: %.03f (factor=%.03f)" % (handlers[i], times[i], times[i]/times[0]) i = len(handlers) t1 = time.time() v = [] for c in s: try: v.append(c.encode("ascii")) except UnicodeError: v.append("&#%d;" % ord(c)) res[i] = "".join(v) t2 = time.time() times[i] = t2-t1 print "workaround: %.03f (factor=%.03f)" % (times[i], times[i]/times[0])

Walter Dörwald <walter@livinglogic.de> writes:
Those numbers are impressive. Can you please add def xml4(exc): if isinstance(exc, UnicodeEncodeError): if exc.end-exc.start == 1: return u"&#"+str(ord(exc.object[exc.start]))+u";" else: r = [] for c in exc.object[exc.start:exc.end]: r.extend([u"&#", str(ord(c)), u";"]) return u"".join(r) else: raise TypeError("don't know how to handle %r" % exc) and report how that performs (assuming I made no error)?
Using a callback instead of the inline implementation is a factor of 12 slower than ignore.
For the purpose of comparing C and Python, this isn't relevant, is it? Only the C version of xmlcharrefreplace and a Python version should be compared.
You could add methods into the class set_reason etc, which error handler authors would have to use. Again, these methods could be added through Python code, so no C code would be necessary to implemenet them. You could even implement a setattr method in Python - although you'ld have to search this from C while initializing the class. Regards, Martin

Martin v. Loewis wrote:
You must return a tuple (replacement, new input position) otherwise the code is correct. It tried it and two new versions: def xml5(exc): if isinstance(exc, UnicodeEncodeError): return (u"&#%d;" % ord(exc.object[exc.start]), exc.start+1) else: raise TypeError("don't know how to handle %r" % exc) def xml6(exc): if isinstance(exc, UnicodeEncodeError): return (u"&#" + str(ord(exc.object[exc.start]) + u";"), exc.start+1) else: raise TypeError("don't know how to handle %r" % exc) Here are the results: 1790000 chars, 2.330% unenc ignore: 0.022 (factor=1.000) xmlcharrefreplace: 0.042 (factor=1.935) xml2: 0.264 (factor=12.084) xml3: 0.733 (factor=33.529) xml4: 0.504 (factor=23.057) xml5: 0.474 (factor=21.649) xml6: 0.481 (factor=22.010) workaround: 5.138 (factor=234.862)
I was just to lazy to code this. ;) Python is a factor of 2.7 slower than the C callback (or 1.9 for your version).
For me this sounds much more complicated than the current C functions, especially for using them from C, which most codecs probably will. Bye, Walter Dörwald

Having to register the error handler first and then finding it by name smells like a very big hack to me. I understand the reasoning (that you don't want to modify the API of a gazillion C routines to add an error object argument) but it still seems like a hack.... -- - Jack Jansen <Jack.Jansen@oratrix.com> http://www.cwi.nl/~jack - - If I can't dance I don't want to be part of your revolution -- Emma Goldman -

Jack Jansen wrote:
Well, in that case, you would have to call the whole codec registry a hack ;-) I find having the callback available by an alias name very user friendly, but YMMV. The main reason behind this way of doing it is to maintain C API compatibility without adding a complete new b/w compatiblity layer (Walter started out this way; see the SF patch page). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

On maandag, augustus 5, 2002, at 05:47 , M.-A. Lemburg wrote:
No, not really. For codecs I think that there needn't be much of a connection between the codec-supplier and the codec-user. Conceivably the encoding-identifying string being passed to encode() could even have been read from a data file or something. For error handling this is silly: the code calling encode() or decode() will know how it wants errors handled. And if you argue that it isn't really error handling but an extension to the encoding name then maybe it should be treated as such (by appending it to the codec name in the string, as in "ascii;xmlentitydefs" or so?). -- - Jack Jansen <Jack.Jansen@oratrix.com> http://www.cwi.nl/~jack - - If I can't dance I don't want to be part of your revolution -- Emma Goldman -

Jack Jansen wrote:
You are omitting the fact, though, that different codecs may need different implementations of a specific error handler. Now the error handler will always implement the same logic, so to the users it's all the same thing. And by using the string alias he needn't worry about where to get the error handler from (it typically lives with the codec itself). Note that error handling is not really an extension to the encoding itself. It just happens that it can be put to use that way for e.g. escaping non-representable characters. Other applications like fetching extra information from a external sources or logging the positions of coding problems do not fall into this category. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
participants (7)
-
Guido van Rossum
-
Jack Jansen
-
Jack Jansen
-
M.-A. Lemburg
-
martin@v.loewis.de
-
Oren Tirosh
-
Walter Dörwald