>>A utf-8-encoded 8-bit string in Python is *not* a string, but a "ByteArray".
>Another way of putting this is:
>- utf-8 in an 8-bit string is to a unicode string what a pickle is to an
>- defaulting to utf-8 upon coercing is like implicitly trying to unpickle
>an 8-bit string when comparing it to an instance. Bad idea.
>Defaulting to Latin-1 is the only logical choice, no matter how
>western-culture-centric this may seem.
The Van Rossum Common Sense gene strikes again! You guys owe
it to the world to have lots of children.
I agree 100%. Let me also add that if you want to do encoding work
that goes beyond what the library gives you, you absolutely need
a 'byte array' type which makes no assumptions and does nothing
magic to its content. I have always thought of 8-bit strings as 'byte
arrays' and not 'characer arrays', and doing anything magic to
them in literals or standard input is going to cause lots of trouble.
I think our proposal is BETTER than Java, Tcl, Visual Basic etc for
the following reasons:
- you can work with old fashioned strings, which are understood
by everyone to be arrays of bytes, and there is no magic
conversion going on. The bytes in literal strings in your script file
are the bytes that end up in the program.
- you can work with Unicode strings if you want
- you are in explicit control of conversions between them
- both types have similar methods so there isn't much to learn or
The 'no magic' thing is very important with Japanese, where very
often you need to roll your own codecs and look at the raw bytes;
any auto-conversion might not go through the filter you want and
you've already lost information before you started. Especially If
your job is to repair possibly corrupt data. Any company with
a few extra custom characters in the user-defined Shift-JIS range
is going to suddenly find their Perl scripts are failing or trashing
all their data as a result of the UTF-8 decision.
I'm also convinced that the majority of Python scripts won't need
to work in Unicode. Even working with exotic languages,
there is always a native 8-bit encoding. I have only used Unicode
(a) working with data that is in several languages
(b) doing conversions, which requires a 'central point'
(b) wanting to do per-character operations safely on multi-byte data
I still haven't sorted out in my head whether the default encoding
thing is a big red herring or is important; I already have a safe way
to construct Unicode literals in my source files if I want to using
But if there has to be one I'd say the following:
- strict ASCII is an option
- Latin-1 is the more generous option that is right for the most
and has a 'special status' among 8-bit encodings
- UTF-8 is not one byte per character and will confuse people
Just my 2p worth,
>In my experience, allowing/requiring programmers to specify sharedness is
>a very rich source of hard-to-find bugs.
My experience is the opposite, since most objects aren't shared. :)
You could probably do something like add an "owning thread" to each object
structure, and on refcount throw an exception if not shared and the current
thread isn't the owner. Not sure if space is a concern, but since the object
is either shared or needs its own mutex, you make them a union:
(Not saying I have an answer to
the performance hit of locking on incref/decref, just saying that the
development cost of 'shared' is very high.)
Thread-SIG maillist - Thread-SIG(a)python.org
Fredrik Lundh replied to himself in c.l.py:
>> as far as I can tell, it's supposed to be a feature.
>> if you mix 8-bit strings with unicode strings, python 1.6a2
>> attempts to interpret the 8-bit string as an utf-8 encoded
>> unicode string.
>> but yes, I also think it's a bug. but this far, my attempts
>> to get someone else to fix it has failed. might have to do
>> it myself... ;-)
>postscript: the powers-that-be has decided that this is not
>a bug. if you thought that strings were just sequences of
>characters, just as in Perl and Tcl, you're in for one big
>surprise in Python 1.6...
I just read the last few posts of the powers-that-be-list on this subject
(Thanks to Christian for pointing out the archives in c.l.py ;-), and I
must say I completely agree with Fredrik. The current situation sucks. A
string should always be a sequence of characters. A utf-8-encoded 8-bit
string in Python is *not* a string, but a "ByteArray". An 8-bit string
should never be assumed to be utf-8 because of that distinction. (The
default encoding for the builtin unicode() function may be another story.)
Python 1.6a2 is around 10% slower than 1.5 on pystone.
Any idea why?
[amk@mira Python-1.6a2]$ ./python Lib/test/pystone.py
Pystone(1.1) time for 10000 passes = 3.59
This machine benchmarks at 2785.52 pystones/second
[amk@mira Python-1.6a2]$ python1.5 Lib/test/pystone.py
Pystone(1.1) time for 10000 passes = 3.19
This machine benchmarks at 3134.8 pystones/second
Hey, don't blame me for posting a joke :-)
Please read from the beginning, don't look at the end first.
No, this is no offense...
Christian Tismer :^) <mailto:firstname.lastname@example.org>
Applied Biometrics GmbH : Have a break! Take a ride on Python's
Kaunstr. 26 : *Starship* http://starship.python.net
14163 Berlin : PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF
where do you want to jump today? http://www.stackless.com
[Moving this to python-dev because it's a musing
> > The main point is to avoid string.*.
> Agreed. Also replacing map by a loop might not even be slower.
> What remains as open question: Several modules need access
> to string constants, and they therefore still have to import
> Is there an elegant solution to this?
> That's why i asked for some way to access "".__class__ or
> whatever, to get into some common namespace with the constants.
I dunno. However, I've noticed that in many situations where map() could
be used with a string.* function (*if* you care about the speed-up and
you don't care about the readability issue), there's no equivalent
that uses the new string methods. This stems from the fact that map()
wants a function, not a method.
Python 3000 solves this partly, assuming types and classes are unified
Where in 1.5 we wrote
in Python 3K we will be able to write
However, this is *still* not as powerful as
map(lambda s: s.strip(), L)
because the former requires that all items in L are in fact strings,
while the latter works for anything with a strip() method (in
particular Unicode objects and UserString instances).
Maybe Python 3000 should recognize map(lambda) and generate more
efficient code for it...
--Guido van Rossum (home page: http://www.python.org/~guido/)