[Python-3000] string C API
paul at prescod.net
Fri Sep 15 20:36:21 CEST 2006
On 9/15/06, Antoine Pitrou <solipsis at pitrou.net> wrote:
> Le vendredi 15 septembre 2006 à 10:48 -0700, Josiah Carlson a écrit :
> > This is one of the reasons why I was talking Latin-1, UCS-2, and UCS-4:
> You could replace "latin-1" with "one-byte system encoding chosen at
> interpreter startup depending on locale".
> There are lots of 8-bit encodings other than iso-8859-1.
> (for example, my current locale uses iso-8859-15)
> The algorithm for choosing the one-byte encoding could be:
> - if the current locale uses an one-byte encoding, use that encoding
> - otherwise, if current locale language has a popular one-byte encoding
> (for many languages this would mean iso-8859-<X>), use that encoding
> - otherwise, no one-byte encoding
> This would ensure that, for example, Russian text on a system configured
> with a Russian locale does not always end up using two bytes per
> character internally.
I do not believe that this extra complexity will be valuable in the
long-term because most Europeans will switch to UTF-8 locales over the next
five years. The current situation makes no sense. Think about it from the
end-user's point of view:
"You can use KOI8-R/ISO-8859-? or UTF-8.
Pro for KOI8-R:
1. text files will use 0.8% instead of 1% of your hard disk space.
2. backwards compatibility
Pro for UTF-8:
1. Better compatibility with new software
2. Easier to share files across geographic boundaries
3. Ability to encode characters from other character sets
4. Access to characters like smart quotes, wingdings, fractions and so
The result seems obvious to me...8-bit-fixed encodings are a terrible idea
and need to just go away. Let's not build them into Python's core on the
basis of a minor and fleeting performance improvement.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Python-3000