M.-A. Lemburg wrote:
Walter Dörwald wrote:
Martin v. Löwis wrote:
M.-A. Lemburg wrote:
I like the idea, but don't think the implementation is the right way to do it. Instead, I'd suggest using a new error handling strategy "break" ( = break processing as soon as errors are found).
Can you demonstrate this approach in a patch? I think it is unimplementable: the codec cannot communicate to the error callback that it ran out of data.
We would need a special attribute in the exception for that, but the problem IMHO is a different one. This makes it impossible to use other error handling schemes than "break" in stateful decoders.
I don't understand... are you referring to some extra attribute for storing arbitrary state ?
The position of the error is not sufficient to determine whether it is a truncated data error or a real one: both r"a\xf".decode("unicode-escape") and r"a\xfx".decode("unicode-escape") raise a UnicodeDecodeException with exc.end == len(exc.object), i.e. the error is at the end of the input. But in the first case the error will go away once more data is available, but in the second case it won't.
If so, why would adding such an attribute make it impossible to use other error handling schemes ?
It doesn't, but it would make it possible for the callback to distinguish transient errors from real ones.
The problem with your patch is that you are adding a whole new set of decoders to the core which duplicate much of what the already existing decoders implement. I don't like that duplication and would like to find a way to only have *one* implementation per decode operation.
I don't like the duplication either. In fact we might need decoders that pass state, but do complain about truncated data at the end of the stream. I think it's possible to find other solutions. I would prefer stateful decoders implemented in C.
But I hope you agree that this is a problem that should be fixed.
Of course, encoders would have to provide the same interfaces for symmetry reasons.
There are no encoders that have to keep state, except for UTF-16.
Bye, Walter Dörwald