[I18n-sig] Literal strings

M.-A. Lemburg mal@lemburg.com
Fri, 02 Jun 2000 11:32:29 +0200

Paul Prescod wrote:
> I am thinking about string literals. Not narrow strings in general, just
> string literals in particular. I'm not sure where we left the issue of a
> statement about the "encoding" of string literals. Here's my input.
> I have a lot of code like this:
> if tagName=="foo":
>         ...
> I would like it to magically work with Unicode. Guido's proposal allows
> it to magically work with Unicode-encoded ASCII, but not with the full
> range of Unicode characters. I'm not entirely happy that my code will
> crash and burn the first time someone pops in a cedilla.
> What would be the consequences of a module-level pragma that allows the
> literal strings in my module to be interpreted as *Unicode literals*
> instead of ASCII literals. I usually know that all of the literals in my
> program are raw ASCII, so even if they are interpreted as Unicode, they
> will be "compatible with" raw ASCII input. The only thing that they
> would not be compatible with is 8-bit binary goo, which they were never
> intended to be compatible with anyhow.
> I just want to add something at the top of my file like:
> #pragma IL8N
> and have my literal strings act as Unicode.
> Now I could go through my code and change all of the literals to Unicode
> literals by hand, but
>  a) that's really ugly, syntactically
>  b) I feel like I'll end up switching them all back when we just make
> literal strings "wide" by default
>  c) I feel like I'm being penalized for making my program
> internationalized
>  d) I have a lot of code, as we all do.

You can use the exerimental command line flag -U to have the
Python compiler do this for you. The downside is that it does
this for *all* modules and this currently causes much of the
standard lib to fail (that's why it's experimental -- a future
goal should be making the standard lib work with and without

The safest way to do this certainly is by fixing all
instances to use u"" instead of "" (not that hard, really).
Even though this may look strange at first, reading the code
will immediately bring your attention to the fact that you
are dealing with Unicode here -- a #pragma at the top won't
get that much attention and a casual user might wonder
where the u"" strings in variable dumps originate from.

Note that there are plans to add a #pragma to allow
specifying a Python script encoding. Things haven't been
sorted out, though.

One way to do this is by turning
all "" string literals into u"" assuming the encoding
given in the #pragma e.g. Latin-1 or MacRoman -- this would
be along the lines of what you have in mind. The problem
with this is that some string literaly might have to map
to 8-bit strings, so for these you'd need to write e.g.
s"" or something similar.

Marc-Andre Lemburg
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/