different encodings for unicode() and u''.encode(), bug?

Thu Jan 3 18:02:46 EST 2008

On Jan 4, 8:03 am, mario <ma... at ruggier.org> wrote:
> On Jan 2, 2:25 pm, Piet van Oostrum <p... at cs.uu.nl> wrote:
>
> > Apparently for the empty string the encoding is irrelevant as it will not
> > be used. I guess there is an early check for this special case in the code.
>
> In the module I an working on [*] I am remembering a failed encoding
> to allow me, if necessary, to later re-process fewer encodings.

If you were in fact doing that, you would not have had a problem. What
you appear to have been doing is (a) remembering a NON-failing
encoding, and assuming that it would continue not to fail (b) not
differentiating between failure reasons (codec doesn't exist, input
not consistent with specified encoding).

A good strategy when dealing with encodings that are unknown (in the
sense that they come from user input, or a list of encodings you got
out of the manual, or are constructed on the fly (e.g. encoding = 'cp'
+ str(code_page_number) # old MS Excel files)) is to try to decode
some vanilla ASCII alphabetic text, so that you can give an immemdiate
in-context error message.

> In the
> case of an empty string AND an unknown encoding this strategy
> failed...

>
> Anyhow, the question is, should the behaviour be the same for these
> operations, and if so what should it be:
>
> u"".encode("non-existent")
> unicode("", "non-existent")

Perhaps you should make TWO comparisons:
(1)
    unistrg = strg.decode(encoding)
with
    unistrg = unicode(strg, encoding)
[the latter "optimises" the case where strg is ''; the former can't
because its output may be '', not u'', depending on the encoding, so
ut must do the lookup]
(2)
    unistrg = strg.decode(encoding)
with
    strg = unistrg.encode(encoding)
[both always do the lookup]

In any case, a pointless question (IMHO); the behaviour is extremely
unlikely to change, as the chance of breaking existing code outvotes
any desire to clean up a minor inconsistency that is easily worked
around.