Unicode proposal: %-formatting ?

I wonder how we could add %-formatting to Unicode strings without duplicating the PyString_Format() logic. First, do we need Unicode object %-formatting at all ? Second, here is an emulation using strings and <default encoding> that should give an idea of one could work with the different encodings: s = '%s %i abcäöü' # a Latin-1 encoded string t = (u,3) # Convert Latin-1 s to a <default encoding> string via Unicode s1 = unicode(s,'latin-1').encode() # The '%s' will now add u in <default encoding> s2 = s1 % t # Finally, convert the <default encoding> encoded string to Unicode u1 = unicode(s2) Note that .encode() defaults to the current setting of <default encoding>. Provided u maps to Latin-1, an alternative would be: u1 = unicode('%s %i abcäöü' % (u.encode('latin-1'),3), 'latin-1') -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[MAL]
Sure -- in the end, all the world speaks Unicode natively and encodings become historical baggage. Granted I won't live that long, but I may last long enough to see encodings become almost purely an I/O hassle, with all computation done in Unicode.
What's u? A Unicode object? Another Latin-1 string? A default-encoded string? How does the following know the difference?
I don't expect this actually works: for example, change %s to %4s. Assuming u is either UTF-8 or Unicode, PyString_Format isn't smart enough to know that some (or all) characters in u consume multiple bytes, so can't extract "the right" number of bytes from u. I think % formating has to know the truth of what you're doing.
More interesting is fmt % tuple where everything is Unicode; people can muck with Latin-1 directly today using regular strings, so the example above mostly shows artificial convolution.

Tim Peters wrote:
u refers to a Unicode object in the proposal. Sorry, forgot to mention that.
Hmm, guess you're right... format parameters should indeed refer to characters rather than number of encoding bytes. This means a new PyUnicode_Format() implementation mapping Unicode format objects to Unicode objects.
... hmm, there is a problem there: how should the PyUnicode_Format() API deal with '%s' when it sees a Unicode object as argument ? E.g. what would you get in these cases: u = u"%s %s" % (u"abc", "abc") Perhaps we need a new marker for "insert Unicode object here". -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[MAL]
It's a bitch, isn't it <0.5 wink>? I hope they're paying you a lot for this!
... hmm, there is a problem there: how should the PyUnicode_Format() API deal with '%s' when it sees a Unicode object as argument ?
Anything other than taking the Unicode characters as-is would be incomprehensible. I mean, it's a Unicode format string sucking up Unicode strings -- what else could possibly make *sense*?
E.g. what would you get in these cases:
u = u"%s %s" % (u"abc", "abc")
That u"abc" gets substituted as-is seems screamingly necessary to me. I'm more baffled about what "abc" should do. I didn't understand the t#/s# etc arguments, and how those do or don't relate to what str() does. On the face of it, the idea that a gazillion and one distinct encodings all get lumped into "a string object" without remembering their nature makes about as much sense as if Python were to treat all instances of all user-defined classes as being of a single InstanceType type <wink> -- except in the latter case you at least get a __class__ attribute to find your way home again. As an ignorant user, I would hope that u"%s" % string had enough sense to know what string's encoding is all on its own, and promote it correctly to Unicode by magic.
Perhaps we need a new marker for "insert Unicode object here".
%s means string, and at this level a Unicode object *is* "a string". If this isn't obvious, it's likely because we're too clever about what non-Unicode string objects do in this context.

[MAL]
Sure -- in the end, all the world speaks Unicode natively and encodings become historical baggage. Granted I won't live that long, but I may last long enough to see encodings become almost purely an I/O hassle, with all computation done in Unicode.
What's u? A Unicode object? Another Latin-1 string? A default-encoded string? How does the following know the difference?
I don't expect this actually works: for example, change %s to %4s. Assuming u is either UTF-8 or Unicode, PyString_Format isn't smart enough to know that some (or all) characters in u consume multiple bytes, so can't extract "the right" number of bytes from u. I think % formating has to know the truth of what you're doing.
More interesting is fmt % tuple where everything is Unicode; people can muck with Latin-1 directly today using regular strings, so the example above mostly shows artificial convolution.

Tim Peters wrote:
u refers to a Unicode object in the proposal. Sorry, forgot to mention that.
Hmm, guess you're right... format parameters should indeed refer to characters rather than number of encoding bytes. This means a new PyUnicode_Format() implementation mapping Unicode format objects to Unicode objects.
... hmm, there is a problem there: how should the PyUnicode_Format() API deal with '%s' when it sees a Unicode object as argument ? E.g. what would you get in these cases: u = u"%s %s" % (u"abc", "abc") Perhaps we need a new marker for "insert Unicode object here". -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[MAL]
It's a bitch, isn't it <0.5 wink>? I hope they're paying you a lot for this!
... hmm, there is a problem there: how should the PyUnicode_Format() API deal with '%s' when it sees a Unicode object as argument ?
Anything other than taking the Unicode characters as-is would be incomprehensible. I mean, it's a Unicode format string sucking up Unicode strings -- what else could possibly make *sense*?
E.g. what would you get in these cases:
u = u"%s %s" % (u"abc", "abc")
That u"abc" gets substituted as-is seems screamingly necessary to me. I'm more baffled about what "abc" should do. I didn't understand the t#/s# etc arguments, and how those do or don't relate to what str() does. On the face of it, the idea that a gazillion and one distinct encodings all get lumped into "a string object" without remembering their nature makes about as much sense as if Python were to treat all instances of all user-defined classes as being of a single InstanceType type <wink> -- except in the latter case you at least get a __class__ attribute to find your way home again. As an ignorant user, I would hope that u"%s" % string had enough sense to know what string's encoding is all on its own, and promote it correctly to Unicode by magic.
Perhaps we need a new marker for "insert Unicode object here".
%s means string, and at this level a Unicode object *is* "a string". If this isn't obvious, it's likely because we're too clever about what non-Unicode string objects do in this context.
participants (3)
-
Guido van Rossum
-
M.-A. Lemburg
-
Tim Peters