Divorcing str and unicode (no more implicit conversions).
Hi. Like a lot of people (or so I hear in the blogosphere...), I've been experiencing some friction in my code with unicode conversion problems. Even when being super extra careful with the types of str's or unicode objects that my variables can contain, there is always some case or oversight where something unexpected happens which results in a conversion which triggers a decode error. str.join() of a list of strs, where one unicode object appears unexpectedly, and voila! exception galore. Sometimes the problem shows up late because your test code doesn't always contain accented characters. I'm sure many of you experienced that or some variant at some point. I came to realize recently that this problem shares strong similarity with the problem of implicit type conversions in C++, or at least it feels the same: Stuff just happens implicitly, and it's hard to track down where and when it happens by just looking at the code. Part of the problem is that the unicode object acts a lot like a str, which is convenient, but... What if we could completely disable the implicit conversions between unicode and str? In other words, if you would ALWAYS be forced to call either .encode() or .decode() to convert between one and the other... wouldn't that help a lot deal with that issue? How hard would that be to implement? Would it break a lot of code? Would some people want that? (I know I would, at least for some of my code.) It seems to me that this would make the code more explicit and force the programmer to become more aware of those conversions. Any opinions welcome. cheers,
Martin Blais <blais@furius.ca> writes:
What if we could completely disable the implicit conversions between unicode and str? In other words, if you would ALWAYS be forced to call either .encode() or .decode() to convert between one and the other... wouldn't that help a lot deal with that issue?
I don't know. I've made one or two apps safe against this and it's mostly just annoying.
How hard would that be to implement?
import sys reload(sys) sys.setdefaultencoding('undefined')
Would it break a lot of code? Would some people want that? (I know I would, at least for some of my code.) It seems to me that this would make the code more explicit and force the programmer to become more aware of those conversions. Any opinions welcome.
I'm not sure it's a sensible default. Cheers, mwh -- It is never worth a first class man's time to express a majority opinion. By definition, there are plenty of others to do that. -- G. H. Hardy
Michael Hudson wrote:
Martin Blais <blais@furius.ca> writes:
What if we could completely disable the implicit conversions between unicode and str? In other words, if you would ALWAYS be forced to call either .encode() or .decode() to convert between one and the other... wouldn't that help a lot deal with that issue?
I don't know. I've made one or two apps safe against this and it's mostly just annoying.
How hard would that be to implement?
import sys reload(sys) sys.setdefaultencoding('undefined')
You shouldn't post tricks like these :-) The correct way to change the default encoding is by providing a sitecustomize.py module which then call the sys.setdefaultencoding("undefined"). Note that the codec "undefined" was added for just this reason.
Would it break a lot of code? Would some people want that? (I know I would, at least for some of my code.) It seems to me that this would make the code more explicit and force the programmer to become more aware of those conversions. Any opinions welcome.
I'm not sure it's a sensible default.
Me neither, especially since this would make it impossible to write polymorphic code - e.g. ', '.join(list) wouldn't work anymore if list contains Unicode; dito for u', '.join(list) with list containing a string. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 30 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
On 10/3/05, M.-A. Lemburg <mal@egenix.com> wrote:
I'm not sure it's a sensible default.
Me neither, especially since this would make it impossible to write polymorphic code - e.g. ', '.join(list) wouldn't work anymore if list contains Unicode; dito for u', '.join(list) with list containing a string.
Sounds like what you want is exactly what I want to avoid (for those two types anyway). cheers,
M.-A. Lemburg wrote:
Michael Hudson wrote:
Martin Blais <blais@furius.ca> writes:
What if we could completely disable the implicit conversions between unicode and str? In other words, if you would ALWAYS be forced to call either .encode() or .decode() to convert between one and the other... wouldn't that help a lot deal with that issue?
I don't know. I've made one or two apps safe against this and it's mostly just annoying.
How hard would that be to implement?
import sys reload(sys) sys.setdefaultencoding('undefined')
You shouldn't post tricks like these :-)
The correct way to change the default encoding is by providing a sitecustomize.py module which then call the sys.setdefaultencoding("undefined").
This is a much more evil trick IMO, as it affects all Python code, rather than a single program. I would argue that it's evil to change the default encoding in the first place, except in this case to disable implicit encoding or decoding. Jim -- Jim Fulton mailto:jim@zope.com Python Powered! CTO (540) 361-1714 http://www.python.org Zope Corporation http://www.zope.com http://www.zope.org
Jim Fulton wrote:
I would argue that it's evil to change the default encoding in the first place, except in this case to disable implicit encoding or decoding.
absolutely. unfortunately, all attempts to add such information to the sys module documentation seem to have failed... (last time I tried, I seem to remember that someone argued that "it's there, so it should be documented in a neutral fashion") </F>
Martin Blais wrote:
On 10/3/05, Michael Hudson <mwh@python.net> wrote:
Martin Blais <blais@furius.ca> writes:
How hard would that be to implement?
import sys reload(sys) sys.setdefaultencoding('undefined')
Hmmm any particular reason for the call to reload() here?
Yes. setdefaultencoding() is removed from sys by site.py. To get it again you must reload sys. Reinhold -- Mail address is perfectly valid!
On 10/15/05, Reinhold Birkenfeld <reinhold-birkenfeld-nospam@wolke7.net> wrote:
Martin Blais wrote:
On 10/3/05, Michael Hudson <mwh@python.net> wrote:
Martin Blais <blais@furius.ca> writes:
How hard would that be to implement?
import sys reload(sys) sys.setdefaultencoding('undefined')
Hmmm any particular reason for the call to reload() here?
Yes. setdefaultencoding() is removed from sys by site.py. To get it again you must reload sys.
Thanks. cheers,
Martin Blais wrote:
Yes. setdefaultencoding() is removed from sys by site.py. To get it again you must reload sys.
Thanks.
Actually, I should take the opportunity to advise people that setdefaultencoding doesn't really work. With the default default encoding, strings and Unicode objects hash equal when they are equal. If you change the default encoding, this property goes away (perhaps unless you change it to Latin-1). As a result, dictionaries where you mix string and Unicode keys won't work: you might not find a value for a string key when looking up with a Unicode object, and vice versa. Regards, Martin
Le lundi 03 octobre 2005 à 02:09 -0400, Martin Blais a écrit :
What if we could completely disable the implicit conversions between unicode and str?
This would be very annoying when dealing with some modules or libraries where the type (str / unicode) returned by a function depends on the context, build, or platform. A good rule of thumb is to convert to unicode everything that is semantically textual, and to only use str for what is to be semantically treated as a string of bytes (network packets, identifiers...). This is also, AFAIU, the semantic model which is favoured for a hypothetical future version of Python. This is what I'm using to do safe conversion to a given type without worrying about the type of the argument: DEFAULT_CHARSET = 'utf-8' def safe_unicode(s, charset=None): """ Forced conversion of a string to unicode, does nothing if the argument is already an unicode object. This function is useful because the .decode method on an unicode object, instead of being a no-op, tries to do a double conversion back and forth (which often fails because 'ascii' is the default codec). """ if isinstance(s, str): return s.decode(charset or DEFAULT_CHARSET) else: return s def safe_str(s, charset=None): """ Forced conversion of an unicode to string, does nothing if the argument is already a plain str object. This function is useful because the .encode method on an str object, instead of being a no-op, tries to do a double conversion back and forth (which often fails because 'ascii' is the default codec). """ if isinstance(s, unicode): return s.encode(charset or DEFAULT_CHARSET) else: return s Good luck Antoine.
Antoine Pitrou wrote:
A good rule of thumb is to convert to unicode everything that is semantically textual
and isn't pure ASCII. (anyone who are tempted to argue otherwise should benchmark their applications, both speed- and memorywise, and be prepared to come up with very strong arguments for why python programs shouldn't be allowed to be fast and memory-efficient whenever they can...) </F>
Le lundi 03 octobre 2005 à 14:59 +0200, Fredrik Lundh a écrit :
Antoine Pitrou wrote:
A good rule of thumb is to convert to unicode everything that is semantically textual
and isn't pure ASCII.
How can you be sure that something that is /semantically textual/ will always remain "pure ASCII" ? That's contradictory, unless your software never goes out of the anglo-saxon world (and even...).
(anyone who are tempted to argue otherwise should benchmark their applications, both speed- and memorywise, and be prepared to come up with very strong arguments for why python programs shouldn't be allowed to be fast and memory-efficient whenever they can...)
I think most applications don't critically depend on text processing performance. OTOH, international adaptability is the kind of thing that /will/ bite you one day if you don't prepare for it at the beginning. Also, if necessary, the distinction could be an implementation detail and the conversion be transparent (like int vs. long): the text would be coded in an 8-bit charset as long as possible and converted to a wide encoding only when necessary. The important thing is that these optimisations, if they are necessary, should be transparently handled by the Python runtime. (it seems to me - I may be mistaken - that modern Windows versions treat every string as 16-bit unicode internally. Why are they doing it if it is that inefficient?) Regards Antoine.
Antoine Pitrou wrote:
A good rule of thumb is to convert to unicode everything that is semantically textual
and isn't pure ASCII.
How can you be sure that something that is /semantically textual/ will always remain "pure ASCII" ?
"is" != "will always remain" </F>
Antoine Pitrou <solipsis@pitrou.net> wrote:
Le lundi 03 octobre 2005 à 14:59 +0200, Fredrik Lundh a écrit :
Antoine Pitrou wrote:
A good rule of thumb is to convert to unicode everything that is semantically textual
and isn't pure ASCII.
How can you be sure that something that is /semantically textual/ will always remain "pure ASCII" ? That's contradictory, unless your software never goes out of the anglo-saxon world (and even...).
Non-unicode text input widgets. Works great. Can be had with the ANSI wxPython installation.
(it seems to me - I may be mistaken - that modern Windows versions treat every string as 16-bit unicode internally. Why are they doing it if it is that inefficient?)
Because modern Windows supports all sorts of symbols which are necessary for certain special English uses (greek symbols for math, etc.), and trying to have all of them without just using the unicode backend that is used for all of the international "builds" (isn't it just a language definition?) anyways, would be a waste of time/effort. - Josiah
Josiah Carlson wrote:
and isn't pure ASCII.
How can you be sure that something that is /semantically textual/ will always remain "pure ASCII" ? That's contradictory, unless your software never goes out of the anglo-saxon world (and even...).
Non-unicode text input widgets. Works great. Can be had with the ANSI wxPython installation.
You're both missing that Python is dynamically typed. A single string source doesn't have to return the same type of strings, as long as the objects it returns are compatible with Python's string model and with each other. Under the default encoding (and quite a few other encodings), that's true for plain ascii strings and Unicode strings. This is a good thing. </F>
Hi, Josiah:
How can you be sure that something that is /semantically textual/ will always remain "pure ASCII" ? That's contradictory, unless your software never goes out of the anglo-saxon world (and even...).
Non-unicode text input widgets.
You didn't understand my statement. I didn't mean : - how can you /technically enforce/ no unicode text at all but : - how can you be sure that your users will never /want/ to enter some text that can't be represented with the current 8-bit charset? Of course the answer to the latter is: you can't. Fredrik:
Under the default encoding (and quite a few other encodings), that's true for plain ascii strings and Unicode strings.
If I have an unicode string containing legal characters greater than 0x7F, and I pass it to a function which converts it to str, the conversion fails. If I have an 8-bit string containing legal non-ascii characters in it (for example the name of a file as returned by the filesystem, which I of course have no prior control on), and I give it to a function which does an implicit conversion to unicode, the conversion fails. Here is an example so that you really understand. I am under a French locale (iso-8859-15), let's just try to enter a French word and see what happens when converting to unicode: -> As a string constant:
s = "été" s '\xe9t\xe9' u = unicode(s) Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128)
-> By asking for input:
s = raw_input() été s '\xe9t\xe9' unicode(s) Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128)
It should work, but it fails miserably. In the current situation, if the programmer doesn't carefully plan for these cases by manually managing conversions (which of course he can do - but it's boring and bothersome - not to mention that many programmers do not even understand the issue!), some users will see the program die with a nasty exception, just because they happen to need a bit more than the plain latin alphabet without diacritics. (even the standard Python library is bitten: witness the weird getcwd() / getcwdu() pair...) I find it surprising that you claim there is no difficulty when everything points to the contrary. See for example how often confused developers ask for help on mailing-lists... Regards Antoine.
Antoine Pitrou wrote:
Under the default encoding (and quite a few other encodings), that's true for plain ascii strings and Unicode strings.
If I have an unicode string containing legal characters greater than 0x7F, and I pass it to a function which converts it to str, the conversion fails.
so? if it does that, it's not unicode safe. what's that has to do with my argument (which is that you can safely mix ascii strings and unicode strings, because that's how things were designed).
Here is an example so that you really understand.
I wrote the unicode type. I do understand how it works. </F>
Hi, Le lundi 03 octobre 2005 à 20:37 +0200, Fredrik Lundh a écrit :
If I have an unicode string containing legal characters greater than 0x7F, and I pass it to a function which converts it to str, the conversion fails.
so? if it does that, it's not unicode safe. [...] what's that has to do with my argument (which is that you can safely mix ascii strings and unicode strings, because that's how things were designed).
If that's how things were designed, then Python's entire standard library (not to mention third-party libraries) is not "unicode safe" - to quote your own words - since many functions may return 8-bit strings containing non-ascii characters. There lies the problem for many people, until the stdlib is fixed - or until the string types are changed. That's why you very regularly see people complaining about how conversions sometimes break their code in various ways. Anyway, I don't think we will reach an agreement here. We have different expectations w.r.t. to how the programming language may/should handle general text. I propose we end the discussion. Regards Antoine.
Antoine Pitrou wrote:
If I have an unicode string containing legal characters greater than 0x7F, and I pass it to a function which converts it to str, the conversion fails.
so? if it does that, it's not unicode safe. [...] what's that has to do with my argument (which is that you can safely mix ascii strings and unicode strings, because that's how things were designed).
If that's how things were designed, then Python's entire standard brary (not to mention third-party libraries) is not "unicode safe" - to quote your own words - since many functions may return 8-bit strings containing non-ascii characters.
huh? first you talk about functions that convert unicode strings to 8-bit strings, now you talk about functions that return raw 8-bit strings? and all this in response to a post that argues that it's in fact a good idea to use plain strings to hold textual data that happens to contain ASCII only, because 1) it works, by design, and 2) it's almost always more efficient. if you don't know what your own argument is, you cannot expect anyone to understand it. </F>
If that's how things were designed, then Python's entire standard brary (not to mention third-party libraries) is not "unicode safe" - to quote your own words - since many functions may return 8-bit strings containing non-ascii characters.
huh? first you talk about functions that convert unicode strings to 8-bit strings, now you talk about functions that return raw 8-bit strings?
Are you deliberately missing the argument? And can't you understand that conversions are problematic in both directions (str -> unicode /and/ unicode -> str)? If an stdlib function returns an 8-bit string containing non-ascii data, then this string used in unicode context incurs an implicit conversion, which fails. How's that for "unicode safety" of stdlib functions? Will you argue that this gives no difficulties to anyone ?
all this in response to a post that argues that it's in fact a good idea to use plain strings to hold textual data that happens to contain ASCII only,
To which you apparently didn't read my answer, that is: you can never be sure that a variable containing something which is /semantically/ textual (*) will never contain anything other than ASCII text. For example raw_input() won't tell you that its 8-bit string result contains some chars > 0x7F. Same for many other library functions. How do you cope with (more or less occasional) non-ascii data coming in as 8-bit strings? (*) that is, contains some natural language Either you carefully plan for non-ascii text coming in your application (including workarounds against Python's ascii-by-default conversion policy), or you deliberately cripple your application by deciding that non-ASCII text is forbidden in (some or all) places. Choose the latter and you'll be hostile to users. And this thread began with a poster who found difficult the way implicit conversions happen in Python. So it's very funny that you deny the existence of a problem for certain developers. Antoine.
At 10:38 PM 10/3/2005 +0200, Antoine Pitrou wrote:
To which you apparently didn't read my answer, that is: you can never be sure that a variable containing something which is /semantically/ textual (*) will never contain anything other than ASCII text. For example raw_input() won't tell you that its 8-bit string result contains some chars > 0x7F. Same for many other library functions. How do you cope with (more or less occasional) non-ascii data coming in as 8-bit strings?
Presumably in Python 3.0, opening a file in "text" mode will require an encoding to be specified, and opening it in "binary" mode will cause it to produce or consume byte arrays, not strings. This should apply to sockets too, and really any I/O facility, including GUI frameworks, DBAPI objects, os.listdir(), etc. Of course, to get there we really need to add a convenient bytes type, perhaps by enhancing the current 'array' module. It'd be nice to have a way to get this in 2.x versions so people can start fixing stuff to work the right way. With no 8-bit strings coming in, there should be no unicode/str problems except those you create yourself.
Presumably in Python 3.0, opening a file in "text" mode will require an encoding to be specified, and opening it in "binary" mode will cause it to produce or consume byte arrays, not strings. This should apply to sockets too, and really any I/O facility, including GUI frameworks, DBAPI objects, os.listdir(), etc.
Great :)
Of course, to get there we really need to add a convenient bytes type, perhaps by enhancing the current 'array' module. It'd be nice to have a way to get this in 2.x versions so people can start fixing stuff to work the right way.
Could the "bytes" type be just the same as the current "str" type but without the implicit unicode conversion ? Or am I missing some desired functionality ?
On 10/3/05, Antoine Pitrou <solipsis@pitrou.net> wrote:
Could the "bytes" type be just the same as the current "str" type but without the implicit unicode conversion ? Or am I missing some desired functionality ?
No. It will be a mutable array of bytes. It will intentionally resemble strings as little as possible. There won't be a literal for it. But you will be able to convert between bytes and strings quite easily by specifying an encoding. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
Le lundi 03 octobre 2005 à 14:02 -0700, Guido van Rossum a écrit :
On 10/3/05, Antoine Pitrou <solipsis@pitrou.net> wrote:
Could the "bytes" type be just the same as the current "str" type but without the implicit unicode conversion ? Or am I missing some desired functionality ?
No. It will be a mutable array of bytes. It will intentionally resemble strings as little as possible. There won't be a literal for it.
Thinking about it, it may have to offer the search and replace facilities offered by strings (including regular expressions). Here is an use case : say I'm reading an HTML file (or receiving it over the network). Since the character encoding can be specified in the HTML file itself (in the <head>...</head>), I must first receive it as a bytes object. But then I must fetch the encoding information from the HTML header: therefore I must use some string ops on the bytes object to parse this information. Only after I have discovered the encoding, can I finally convert the bytes object to a text string. Or would there be another way to do it?
This would presumaby support the (read-only part of the) buffer API so search would be covered. I don't see a use case for replace. Alternatively, you could always specify Latin-1 as the encoding and convert it that way -- I don't think there's any input that can cause Latin-1 decoding to fail. On 10/3/05, Antoine Pitrou <solipsis@pitrou.net> wrote:
Le lundi 03 octobre 2005 à 14:02 -0700, Guido van Rossum a écrit :
On 10/3/05, Antoine Pitrou <solipsis@pitrou.net> wrote:
Could the "bytes" type be just the same as the current "str" type but without the implicit unicode conversion ? Or am I missing some desired functionality ?
No. It will be a mutable array of bytes. It will intentionally resemble strings as little as possible. There won't be a literal for it.
Thinking about it, it may have to offer the search and replace facilities offered by strings (including regular expressions).
Here is an use case : say I'm reading an HTML file (or receiving it over the network). Since the character encoding can be specified in the HTML file itself (in the <head>...</head>), I must first receive it as a bytes object. But then I must fetch the encoding information from the HTML header: therefore I must use some string ops on the bytes object to parse this information. Only after I have discovered the encoding, can I finally convert the bytes object to a text string.
Or would there be another way to do it?
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (home page: http://www.python.org/~guido/)
Le lundi 03 octobre 2005 à 17:42 -0700, Guido van Rossum a écrit :
I don't see a use case for replace.
Agreed.
Alternatively, you could always specify Latin-1 as the encoding and convert it that way -- I don't think there's any input that can cause Latin-1 decoding to fail.
You seem to be right. « In 1992, the IANA registered the character map ISO-8859-1 (note the extra hyphen), a superset of ISO/IEC 8859-1, for use on the Internet. This map assigns control characters to the code values 00-1F, 7F, and 80-9F. It thus provides for 256 characters via every possible 8-bit value. » http://en.wikipedia.org/wiki/ISO_8859-1#ISO-8859-1 Regards Antoine.
Antoine Pitrou wrote:
To which you apparently didn't read my answer, that is: you can never be sure that a variable containing something which is /semantically/ textual (*) will never contain anything other than ASCII text.
That is simply not true. There are variables that is semantically textual, yet I can be sure that this is a byte string only if it consists just of ASCII. For example, if you invoke a Tkinter function, it will return a byte string if the result is purely ASCII, else return a Unicode string. This is an interface guarantee, hence I can be sure. Regards, Martin
On 10/3/05, Antoine Pitrou <solipsis@pitrou.net> wrote:
If that's how things were designed, then Python's entire standard brary (not to mention third-party libraries) is not "unicode safe" - to quote your own words - since many functions may return 8-bit strings containing non-ascii characters.
huh? first you talk about functions that convert unicode strings to 8-bit strings, now you talk about functions that return raw 8-bit strings?
Are you deliberately missing the argument? And can't you understand that conversions are problematic in both directions (str -> unicode /and/ unicode -> str)?
Both directions are a problem. Just a note: it's not so much the conversions that I find problematic, but rather the implicit nature of the conversions (combined with the fact that they may fail). In addition to being difficult to track down, these implicit conversions may be costing processing time as well. cheers,
Martin Blais wrote:
On 10/3/05, Antoine Pitrou <solipsis@pitrou.net> wrote:
If that's how things were designed, then Python's entire standard brary (not to mention third-party libraries) is not "unicode safe" - to quote your own words - since many functions may return 8-bit strings containing non-ascii characters.
huh? first you talk about functions that convert unicode strings to 8-bit strings, now you talk about functions that return raw 8-bit strings?
Are you deliberately missing the argument? And can't you understand that conversions are problematic in both directions (str -> unicode /and/ unicode -> str)?
Both directions are a problem.
Just a note: it's not so much the conversions that I find problematic, but rather the implicit nature of the conversions (combined with the fact that they may fail). In addition to being difficult to track down, these implicit conversions may be costing processing time as well.
We've already pointed you to a solution which you might want to use. Why don't you just try it ? BTW, if you want to read up on all the reasons why Unicode was done the way it was, have a look at: http://www.python.org/peps/pep-0100.html and read up in the python-dev archives: http://mail.python.org/pipermail/python-dev/2000-March/thread.html and the next months after the initial checkin.
From what I've read on the web about the Python Unicode implementation we have one of the better ones compared to other languages implementations and their choices and design decisions.
None of them is perfect, but that's seems to be an inherent problem with Unicode no matter how you try to approach it - even more so, if you are trying to add it to a language that has used ordinary C strings for text from day 1. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 30 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
"M" == "M.-A. Lemburg" <mal@egenix.com> writes:
M> From what I've read on the web about the Python Unicode M> implementation we have one of the better ones compared to other M> languages implementations and their choices and design M> decisions. Yes, indeed! Speaking-as-a-card-carrying-member-of-the-loyal-opposition-ly y'rs, -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.
Antoine> If an stdlib function returns an 8-bit string containing Antoine> non-ascii data, then this string used in unicode context incurs Antoine> an implicit conversion, which fails. Such strings should be converted to Unicode at the point where they enter the application. That's likely the only place where you have a good chance of knowing the data encoding. Files generally have no encoding information associated with them. Some databases don't handle Unicode transparently. If you hang onto the input from such devices as plain strings until you need them as Unicode, you will almost certainly not know how the string was encoded. The state of the outside Unicode world being as miserable as it is (think web input forms), you often don't know the encoding at the interface and have to guess anyway. Even so, isolating that guesswork to the interface is better than recovering somewhere further downstream. Skip
On Oct 3, 2005, at 3:47 PM, Fredrik Lundh wrote:
Antoine Pitrou wrote:
If I have an unicode string containing legal characters greater than 0x7F, and I pass it to a function which converts it to str, the conversion fails.
so? if it does that, it's not unicode safe.
[...]
what's that has to do with my argument (which is that you can safely mix ascii strings and unicode strings, because that's how things were designed).
If that's how things were designed, then Python's entire standard brary (not to mention third-party libraries) is not "unicode safe" - to quote your own words - since many functions may return 8-bit strings containing non-ascii characters.
huh? first you talk about functions that convert unicode strings to 8-bit strings, now you talk about functions that return raw 8-bit strings? and all this in response to a post that argues that it's in fact a good idea to use plain strings to hold textual data that happens to contain ASCII only, because 1) it works, by design, and 2) it's almost always more efficient.
if you don't know what your own argument is, you cannot expect anyone to understand it.
Your point would be much easier to stomach if the "str" type could *only* hold 7-bit ASCII. Perhaps that can be done when Python gets an actual bytes type in 3.0. There indeed are a multitude of uses for the efficient storage/processing of ASCII-only data. However, currently, there are problems because it's so easy to screw yourself without noticing when mixing unicode and str objects. If, on the other hand, you have a 7bit ascii string type, and a 16/32-bit unicode string type, both can be used interchangeably and there is no possibility for any en/de-coding issues. And asciiOnlyStringType.encode('utf-8') can become _ultra_ efficient, as a bonus. :) Seems win-win to me. James
James Y Knight wrote:
Your point would be much easier to stomach if the "str" type could *only* hold 7-bit ASCII.
why? strings are not mutable, so it's not like an ASCII string will suddenly sprout non-ASCII characters. what ends up in a string is defined by the string source. if you cannot trust the source, your programs will never work. after all, there's no- thing in Python that keeps things like: s = file.readline().decode("iso-8859-1") s = elem.findtext("node") s = device.read_encoded_data() from returning integers instead of strings, or returning socket objects on odd fridays. but if the interface spec says that they always return strings that adher to python's text model (=unicode or things that can be mixed with unicode), you can trust them as much as you can trust anything else in Python. (this is of course also why we talk about file-like objects in Python, and sequences, and iterators and iterables, and stuff like that. it's not type(obj) that's important, it's what you can do with obj and how it behaves when you do it) </F>
Martin Blais wrote:
Hi.
Like a lot of people (or so I hear in the blogosphere...), I've been experiencing some friction in my code with unicode conversion problems. Even when being super extra careful with the types of str's or unicode objects that my variables can contain, there is always some case or oversight where something unexpected happens which results in a conversion which triggers a decode error. str.join() of a list of strs, where one unicode object appears unexpectedly, and voila! exception galore. Sometimes the problem shows up late because your test code doesn't always contain accented characters. I'm sure many of you experienced that or some variant at some point.
I came to realize recently that this problem shares strong similarity with the problem of implicit type conversions in C++, or at least it feels the same: Stuff just happens implicitly, and it's hard to track down where and when it happens by just looking at the code. Part of the problem is that the unicode object acts a lot like a str, which is convenient, but...
I agree. I think it was a mistake to implicitly convert mixed string expressions to unicode.
What if we could completely disable the implicit conversions between unicode and str? In other words, if you would ALWAYS be forced to call either .encode() or .decode() to convert between one and the other... wouldn't that help a lot deal with that issue?
Perhaps.
How hard would that be to implement?
Not hard. We considered doing it for Zope 3, but ...
Would it break a lot of code?
Yes.
Would some people want that?
No, I wouldn't want lots of code to break. ;)
(I know I would, at least for some of my code.) It seems to me that this would make the code more explicit and force the programmer to become more aware of those conversions. Any opinions welcome.
I think it's too late to change this. I wish it had been done differently. (OTOH, I'm very happy we have Unicode support, so I'm not really complaining. :) I'll note that this hasn't been that much of a problem for us in Zope. We follow the strategy: Antoine Pitrou wrote: ...
A good rule of thumb is to convert to unicode everything that is semantically textual, and to only use str for what is to be semantically treated as a string of bytes (network packets, identifiers...). This is also, AFAIU, the semantic model which is favoured for a hypothetical future version of Python.
This approach has worked pretty well for us. Still, when there is a problem, it's a real pain to debug because the error occurs too late, as you point out. Jim -- Jim Fulton mailto:jim@zope.com Python Powered! CTO (540) 361-1714 http://www.python.org Zope Corporation http://www.zope.com http://www.zope.org
participants (14)
-
"Martin v. Löwis" -
Antoine Pitrou -
Fredrik Lundh -
Guido van Rossum -
James Y Knight -
Jim Fulton -
Josiah Carlson -
M.-A. Lemburg -
Martin Blais -
Michael Hudson -
Phillip J. Eby -
Reinhold Birkenfeld -
skip@pobox.com -
Stephen J. Turnbull