re.sub() bug?

Sat Nov 9 11:25:59 EST 2002

Mike Brown wrote:

> Python 2.2.1 on FreeBSD.
> These work as expected:
>
> >>> re.sub(u'f', u'b', u'foo')  # keep string as Unicode
> u'boo'
> >>> re.sub(u'f', u'b', 'foo')   # coerce string to Unicode
> u'boo'
>
> But this doesn't work the way I'd think it would:
>
> >>> re.sub(u'f', u'b', u'')     # coerce string to non-Unicode?!
> ''
>
> So, is this a bug?

It's a buglet, sure.

But if you write code that depends on this difference, your code is
a lot more fragile (and less future-proof) that it should be.

Python's Unicode system allows you to mix Unicode strings with
standard strings, as long as the latter contain only ASCII characters.
Good practice is to make sure your code is as tolerant as Python.

(or to put it another way, write code that does the right thing if
an operation returns a Unicode string instead of the corresponding
ASCII string, and likewise, if a function that usually returns a Uni-
code string returns an ordinary string instead).

Standard strings containing non-ASCII data is a different thing;
they're encoded, and should be seen as binary buffers.

</F>