
Le mercredi 25 mai 2011 à 23:41 +0200, Laura Creighton a écrit :
One reason I didn't implement the classes yet is that I couldn't understand two points in how they are supposed to work. But it seems that there are really two bugs, as I've been pointed to: http://bugs.python.org/issue12100 and http://bugs.python.org/issue12171 . So the question is if we should be bug-compatible with Python 2.7 or if we should instead implement some fixed version.
I fixed #12100 in Python 2.7, 3.1, 3.2, 3.3 yesterday. I plan also to fix #12171 in these four versions, it should be done next days.
I suppose I'm rather for the fixed version, but I'd like to hear some feedback from people that actually use multibytecodecs.
Both bugs are related to encoders. I don't think that anyone is using Python CJK codecs to encode text (because nobody noticed these bugs before), but more likely to decode text. Anyway, you should implement a codec without these *bugs*. For your information, I added more tests to the CJK codecs (e.g. see #12057), and I plan to add more tests next weeks. I plan also to fix issue #12016, yet another CJK codec bug. You may want to wait until all of these bugs are fixed before working on your own implementation, or implement directly a version without all of these bugs, and then upgrade the test suite.
Also, I wouldn't mind if someone would pick up the work and just do it, either the classes or ``errors !=3D "strict"'' :-)
The support of error handlers different than strict is far from being perfect. Issue #12016 is the main problem, but there are other minor issues. In some cases, invalid byte sequences are ignored even with the replace error handler (whereas I expected U+FFFD characters). CJK codecs are special because they use escape sequences (especially the ISO 2022 family): what should be done if a byte sequence looks like an escape sequences, but it is not valid? Replace each byte by U+FFFD, or ignore these bytes? I'm trying to write tests "describing" the current behaviour, and then I will maybe try to improve how invalid byte sequences are handled. Victor