Mailman 3 Unicode proposal: %-formatting ? - Python-Dev

newer
RE: [Python-Dev] Unicode proposal:...

Unicode proposal: %-formatting ?

M.-A. Lemburg

Nov. 11, 1999

3:59 p.m.

I wonder how we could add %-formatting to Unicode strings without duplicating the PyString_Format() logic. First, do we need Unicode object %-formatting at all ? Second, here is an emulation using strings and <default encoding> that should give an idea of one could work with the different encodings: s = '%s %i abcäöü' # a Latin-1 encoded string t = (u,3) # Convert Latin-1 s to a <default encoding> string via Unicode s1 = unicode(s,'latin-1').encode() # The '%s' will now add u in <default encoding> s2 = s1 % t # Finally, convert the <default encoding> encoded string to Unicode u1 = unicode(s2) Note that .encode() defaults to the current setting of <default encoding>. Provided u maps to Latin-1, an alternative would be: u1 = unicode('%s %i abcäöü' % (u.encode('latin-1'),3), 'latin-1') -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Show replies by date

Tim Peters

November 1999

5:38 a.m.

[MAL]

...

Sure -- in the end, all the world speaks Unicode natively and encodings become historical baggage. Granted I won't live that long, but I may last long enough to see encodings become almost purely an I/O hassle, with all computation done in Unicode.

...

What's u? A Unicode object? Another Latin-1 string? A default-encoded string? How does the following know the difference?

...

I don't expect this actually works: for example, change %s to %4s. Assuming u is either UTF-8 or Unicode, PyString_Format isn't smart enough to know that some (or all) characters in u consume multiple bytes, so can't extract "the right" number of bytes from u. I think % formating has to know the truth of what you're doing.

...

More interesting is fmt % tuple where everything is Unicode; people can muck with Latin-1 directly today using regular strings, so the example above mostly shows artificial convolution.

M.-A. Lemburg

10:40 a.m.

Tim Peters wrote:

...

u refers to a Unicode object in the proposal. Sorry, forgot to mention that.

...

Hmm, guess you're right... format parameters should indeed refer to characters rather than number of encoding bytes. This means a new PyUnicode_Format() implementation mapping Unicode format objects to Unicode objects.

...

... hmm, there is a problem there: how should the PyUnicode_Format() API deal with '%s' when it sees a Unicode object as argument ? E.g. what would you get in these cases: u = u"%s %s" % (u"abc", "abc") Perhaps we need a new marker for "insert Unicode object here". -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Guido van Rossum

1:28 p.m.

...

Tim Peters

7:33 a.m.

[MAL]

...

... This means a new PyUnicode_Format() implementation mapping Unicode format objects to Unicode objects.

It's a bitch, isn't it <0.5 wink>? I hope they're paying you a lot for this!

...

... hmm, there is a problem there: how should the PyUnicode_Format() API deal with '%s' when it sees a Unicode object as argument ?

Anything other than taking the Unicode characters as-is would be incomprehensible. I mean, it's a Unicode format string sucking up Unicode strings -- what else could possibly make *sense*?

...

E.g. what would you get in these cases:

u = u"%s %s" % (u"abc", "abc")

That u"abc" gets substituted as-is seems screamingly necessary to me. I'm more baffled about what "abc" should do. I didn't understand the t#/s# etc arguments, and how those do or don't relate to what str() does. On the face of it, the idea that a gazillion and one distinct encodings all get lumped into "a string object" without remembering their nature makes about as much sense as if Python were to treat all instances of all user-defined classes as being of a single InstanceType type <wink> -- except in the latter case you at least get a __class__ attribute to find your way home again. As an ignorant user, I would hope that u"%s" % string had enough sense to know what string's encoding is all on its own, and promote it correctly to Unicode by magic.

...

Perhaps we need a new marker for "insert Unicode object here".

%s means string, and at this level a Unicode object *is* "a string". If this isn't obvious, it's likely because we're too clever about what non-Unicode string objects do in this context.

Tim Peters

November 1999

5:38 a.m.

[MAL]

...

What's u? A Unicode object? Another Latin-1 string? A default-encoded string? How does the following know the difference?

...

More interesting is fmt % tuple where everything is Unicode; people can muck with Latin-1 directly today using regular strings, so the example above mostly shows artificial convolution.

M.-A. Lemburg

10:40 a.m.

Tim Peters wrote:

...

u refers to a Unicode object in the proposal. Sorry, forgot to mention that.

...

Guido van Rossum

1:28 p.m.

...

Tim Peters

7:33 a.m.

[MAL]

...

... This means a new PyUnicode_Format() implementation mapping Unicode format objects to Unicode objects.

It's a bitch, isn't it <0.5 wink>? I hope they're paying you a lot for this!

...

... hmm, there is a problem there: how should the PyUnicode_Format() API deal with '%s' when it sees a Unicode object as argument ?

Anything other than taking the Unicode characters as-is would be incomprehensible. I mean, it's a Unicode format string sucking up Unicode strings -- what else could possibly make *sense*?

...

E.g. what would you get in these cases:

u = u"%s %s" % (u"abc", "abc")

...

Perhaps we need a new marker for "insert Unicode object here".

%s means string, and at this level a Unicode object *is* "a string". If this isn't obvious, it's likely because we're too clever about what non-Unicode string objects do in this context.

9253

Age (days ago)

9259

Last active (days ago)

List overview

Download

4 comments

3 participants

participants (3)

Guido van Rossum
M.-A. Lemburg
Tim Peters

Unicode proposal: %-formatting ?

M.-A. Lemburg

Tim Peters

M.-A. Lemburg

Guido van Rossum

Tim Peters

Tim Peters

M.-A. Lemburg

Guido van Rossum

Tim Peters

tags

participants (3)