[I18n-sig] Normal and unicode strings

Thu, 3 Jan 2002 23:20:12 +0100

> I've started to look at Unicode..
> 
> There're two types of strings in Python, 'str' and 'unicode'.
> I guess there're technical reasons to have two different
> classes. Please, could somebody explain me these reasons?

Strings, traditionally, have been used for two things:

- byte strings, as you get them when reading from a file or a network
  connection, or interacting with the operating system in a variety of
  other ways, and

- character strings, to represent text - typically intended for the
  eventual display to the user using glyphs in some font.

Notice that both uses of strings are equally important. If you
disagree, just consider how you would do things like bitmaps (GIF
files, JPEG files, video streams) or networking protocols (like HTTP
or NFS) without byte strings.

It turns out that there is no meaningful way to support both
simultaneously. To support bytes properly (including the C API), you
really need the property that each element has 256 values which form a
contiguous block in your computer's memory. To support character
strings properly, you need much more than 256 values. Unicode is an
international standard that associated well-defined meanings with more
than 100,000 of these values, so that all languages can represent all
characters in a single character set.

> Please, keep in mind that I've never looked at the Python sources
> and I'm still quite ignorant about Unicode.

If you really want to get familiar with Unicode, the Python
documentation alone is the wrong place. Please refer to
www.unicode.org; they recommend to by their book, but have a lot of
introductory material also.

> I think that for the user (the Python programmer) it would
> be better to have only one class of strings, if possible of
> course. 

No. The user should be always aware whether what he has is a byte
string or a character string. For byte strings, the type name 'str'
should be used; for character strings, the type named 'unicode' is
good.

> Is there any chance that this will be addressed in future versions
> of Python?

Perhaps, but it is unclear how this could work. Most likely, string
literals would mean "character string", but then people that want to
have byte string literals will complain - even the standard library
uses both byte string literals and character string literals, without
distinguishing between them.

There is a patch on SF proposing a migration strategy: First introduce
the notion of byte string literals (b'HTTP/1.0'), then, years later,
consider changing the meaning of plain strings to mean Unicode.

Regards,
Martin