Py 3.3, unicode / upper()

MRAB python at
Thu Dec 20 21:20:57 CET 2012

On 2012-12-20 19:19, wxjmfauth at wrote:
> Fact.
> In order to work comfortably and with efficiency with a "scheme for
> the coding of the characters", can be unicode or any coding scheme,
> one has to take into account two things: 1) work with a unique set
> of characters and 2) work with a contiguous block of code points.
> At this point, it should be noticed I did not even wrote about
> the real coding, only about characters and code points.
> Now, let's take a look at what happens when one breaks the rules
> above and, precisely, if one attempts to work with multiple
> characters sets or if one divides - artificially - the whole range
> of the unicode code points in chunks.
> The first (and it should be quite obvious) consequence is that
> you create bloated, unnecessary and useless code. I simplify
> the flexible string representation (FSR) and will use an "ascii" /
> "non-ascii" model/terminology.
> If you are an "ascii" user, a FSR model has no sense. An
> "ascii" user will use, per definition, only "ascii characters".
> If you are a "non-ascii" user, the FSR model is also a non
> sense, because you are per definition a n"on-ascii" user of
> "non-ascii" character. Any optimisation for "ascii" user just
> become irrelevant.
> In one sense, to escape from this, you have to be at the same time
> a non "ascii" user and a non "non-ascii" user. Impossible.
> In both cases, a FSR model is useless and in both cases you are
> forced to use bloated and unnecessary code.
> The rule is to treat every character of a unique set of characters
> of a coding scheme in, how to say, an "equal way". The problematic
> can be seen the other way, every coding scheme has been built
> to work with a unique set of characters, otherwhile it is not
> properly working!
It's true that in an ideal world you would treat all codepoints the
same. However, this is a case where "practicality beats purity".

In order to accommodate every codepoint you need 3 bytes per codepoint
(although for pragmatic reasons it's 4 bytes per codepoint).

But not all codepoints are used equally. Those in the "astral plane",
for example, are used rarely, so the vast majority of the time you
would be using twice as much memory as strictly necessary. There are
also, in reality, many times in which strings contain only ASCII-range
codepoints, although they may not be visible to the average user, being
the names of functions and attributes in program code, or tags and
attributes in HTML and XML.

FSR is a pragmatic solution to dealing with limited resources.

Would you prefer there to be a switch that makes strings always use 4
bytes per codepoint for those users and systems where memory is no

More information about the Python-list mailing list