[Python-Dev] PEP 383 update: utf8b is now the error handler

Michael Urman murman at gmail.com
Thu May 7 16:18:31 CEST 2009


On Thu, May 7, 2009 at 00:43, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> Michael Urman wrote:
>> On Wed, May 6, 2009 at 15:42, "Martin v. Löwis" <martin at v.loewis.de> wrote:
>>> Despite there being also an error handler called "surrogates".
>>
>> Not that I have to be, but I'm not sold on the previous UTF-8 codec
>> behavior becoming an error handler of the name "surrogates" for two
>> reasons (I do respect the obvious PBP argument for the implementation,
>> and have no better name - "lenient"?).
>
> PBP?

Practicality beats purity. From a purity standpoint, the legacy
invalid utf-8 seems more like an encoding than an error handler to me.
>From a practicality standpoint, it's presumably much more convenient
to implement it on top of the new valid UTF-8 codec's behavior. And
then any error handler needs a name.

> Well, there is a way to stack error handlers, although it's not pretty:
> [...]
> codecs.register_error("surrogates_then_replace",
>                      surrogates_then_replace)

That mitigates my arguments significantly, although I'd rather see
something like errors=('surrogates', 'replace') chain the handlers
without additional registrations. But that's a different PEP or
arbitrary change. :)

>> The stacking argument also applies to the new utf8b behavior on encode
>> (only, as it handles all errors on decode). This may be a YAGNI
>
> Indeed - in particular, as, in the primary application of this error
> handler (i.e. file IO operations), there is no way of specifying
> an addition error handler anyway.

Would it be useful to allow setting this somewhere? It'd be analogous
to setfsencoding, perhaps a setfsencodingerrors. It's not hard to
imagine an application working on Windows where all Unicode characters
are valid, and constructing backup filenames by adding some arbitrary
character, or receiving them from a user who doesn't understand
encodings. When this application is taken to a non-Unicode filesystem,
without the ability to say "I really want a valid filename: so
replace", that could get messy. But it may still be a YAGNI, or a
"don't do that."

-- 
Michael Urman


More information about the Python-Dev mailing list