Alternative Implementation for PEP 292: Simple String Substitutions

Before it's too late and the API gets frozen, I would like to propose an alternate implementation for PEP292 that exposes two functions instead of two classes. Current way: print Template('Turn $direction') % dict(direction='right') Proposed: print dollarsub('Turn $direction', dict(direction='right')) or: print dollarsub('Turn $direction', direction='right') My main issue with the current implementation is that we get no leverage from using a class instead of a function. Though the API is fairly simple either way, it is easier to learn and document functions than classes. We gain nothing from instantiation -- the underlying data remains immutable and no additional state is attached. The only new behavior is the ability to apply the mod operator. Why not do it in one step. I had thought a possible advantage of classes was that they could be usefully subclassed. However, a little dickering around showed that to do anything remotely interesting (beyond changing the pattern alphabet) you have to rewrite everything by overriding both the method and the pattern. Subclassing gained you nothing, but added a little bit of complexity. A couple of simple exercises show this clearly: write a subclass using a different escape character or one using dotted identifiers for attribute lookup in the local namespace -- either way subclasses provide no help and only get in the way. One negative effect of the class implementation is that it inherits from unicode and always returns a unicode result even if all of the inputs (mapping values and template) are regular strings. With a function implementation, that can be avoided (returning unicode only if one of the inputs is unicode). The function approach also makes it possible to have keyword arguments (see the example above) as well as a mapping. This isn't a big win, but it is nice to have and reads well in code that is looping over multiple substitutions (mailmerge style): for girl in littleblackbook: print dollarsub(loveletter, name=girl[0].title(), favoritesong=girl[3]) Another minor advantage for a function is that it is easier to lookup in the reference. If a reader sees the % operator being applied and looks it up in the reference, it is going to steer them in the wrong direction. This is doubly true if the Template instantiation is remote from the operator application. Summary for functions: * is more appropriate when there is no state * no unnecessary instantiation * can be applied in a single step * a little easier to learn/use/document * doesn't force result to unicode * allows keyword arguments * easy to find in the docs Raymond ----------- Sample Implementation ------------- def dollarsub(template, mapping=None, **kwds): """A function for supporting $-substitutions.""" typ = type(template) if mapping is None: mapping = kwds def convert(mo): escaped, named, braced, bogus = mo.groups() if escaped is not None: return '$' if bogus is not None: raise ValueError('Invalid placeholder at index %d' % mo.start('bogus')) val = mapping[named or braced] return typ(val) return _pattern.sub(convert, template) def safedollarsub(template, mapping=None, **kwds): """A function for $-substitutions. This function is 'safe' in the sense that you will never get KeyErrors if there are placeholders missing from the interpolation dictionary. In that case, you will get the original placeholder in the value string. """ typ = type(template) if mapping is None: mapping = kwds def convert(mo): escaped, named, braced, bogus = mo.groups() if escaped is not None: return '$' if bogus is not None: raise ValueError('Invalid placeholder at index %d' % mo.start('bogus')) if named is not None: try: return typ(mapping[named]) except KeyError: return '$' + named try: return typ(mapping[braced]) except KeyError: return '${' + braced + '}' return _pattern.sub(convert, template)

On Thu, Aug 26, 2004, Raymond Hettinger wrote:
* doesn't force result to unicode
This is the main reason I'm +0, pending further arguments. OTOH, I also like using %, so you'd have to come up with more points to move me beyond +0. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "To me vi is Zen. To use vi is to practice zen. Every command is a koan. Profound to the user, unintelligible to the uninitiated. You discover truth everytime you use it." --reddy@lion.austin.ibm.com

On Thu, 2004-08-26 at 17:38, Raymond Hettinger wrote:
Weren't you the one who gave the Cheetah example? What was interesting about that was that the instance's attributes formed the substitution namespace. That's a use case I instantly liked. So there you have state attached to an instance. Another case for that would be in i18n applications where you might want to attach information such as the gettext domain to the instance. You might also want to build up the namespace in several locations, and delay performing the substitution until the last possible moment. In all those cases you have state attached to an instance (and would immediately invent such an instance for those use cases if you didn't have one).
To me that's not a disadvantage. For i18n applications, unicode is the only reasonable thing for human readable text. str's are only useful for byte arrays <wink>. It's not a disadvantage for Jython or IronPython either. :)
The mod operator was chosen because that's what people are familiar with, but it would probably be okay to pick a different method name. I think Guido's suggested using __call__() -- which I want to think more about.
This is doubly true if the Template instantiation is remote from the operator application.
Which, in some use cases, it most definitely will be. -Barry

[Raymond]
[Barry]
To me that's not a disadvantage.
By not inheriting from unicode, the bug can be fixed while retaining a class implementation (see sandbox\curry292.py for an example). But, be clear, it *is* a bug. If all the inputs are strings, Unicode should not magically appear. See all the other string methods as an example. Someday, all will be Unicode, until then, some apps choose to remain Unicode free. Also, there is a build option to not even compile Unicode support -- it would be a bummer to have the $ templates fail as a result. Raymond P.S. Here's the doctest from the sandbox code. What is at issue is the result of the first test:

On Mon, 2004-08-30 at 01:48, Raymond Hettinger wrote:
But the Template classes aren't string methods, so I don't think the analogy is quite right. Because the template string itself is by definition a Unicode, it actually makes more sense that everything its mod operator returns is also a Unicode. So I still don't think it's a bug.
Maybe. Like the doctor says, well, don't do that! (i.e. use Templates and disable unicode). -Barry

Templates are not Unicode by definition. That is an arbitrary implementation quirk and a design flaw. The '%(key)s' forms do not behave this way. They return str unless one of the inputs are unicode. People should be able to use Python and not have to deal with Unicode unless that is an intentional part of their design. Unless there is some compelling advantage to going beyond the PEP and changing all the rules, it is a bug. Raymond

Raymond Hettinger wrote:
I think Barry needs some backup here. First, please be aware that normal use of Templates is for formatting *text* data. Second, it is good design and good practice to store text data in Unicode objects, because that's what they were designed for, while string objects have always been an abstract container for storing bytes with varying meanings and interpretations. The latter is a design flaw that needs to get fixed, not the choice of Unicode as Template base class. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 03 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

[MAL]
IMO, it is subversive to start taking new string functions/methods and coercing their results to Unicode. Someday we may be there, Py3.0 perhaps, but str is not yet deprecated. Until then, a user should reasonably expect SISO str in, str out. This is doubly true when the rest of python makes active efforts to avoid SIUO (see % formatting and''.join() for example). Someday Guido may get wild and turn all text uses of str into unicode. Most likely, it will need a PEP so that all the issues get thought through and everything gets changed at once. Slipping this into the third alpha as if it were part of PEP292 is not a good idea. The PEP was about simplification. Tossing in unnecessary unicode coercions is not in line with that goal. Does anyone else think this is a crummy idea? Is everyone ready for unicode coercions to start sprouting everywhere? Raymond

On Fri, Sep 03, 2004, Raymond Hettinger wrote:
+0 (agreeing with Raymond) Correct me if I'm wrong, but there are a couple of issues here: * First of all, I believe that unicode strings are interoperable (down to hashing) with 8-bit strings, as long as there are no non-7-bit ASCII characters. Where things get icky is with encoded 8-bit strings making use of e.g. Latin-1. So the question is whether we need full interoperability. * Unicode strings take four bytes per character (not counting decomposed characters). Is it fair at this point in Python's evolution to force this kind of change in performance metric, essentially silently? The PEP and docs do make the issue of Unicode fairly clear up-front, so anyone choosing to use template strings knows what zie is getting into. But what about someone grabbing a module that uses template strings internally?.... OTOH, I'm not up for making a big issue out of this. If Raymond really is the only person who feels strongly about it, it probably isn't going to be a big deal in practice. In addition, I think it's the kind of change that could be easily fixed in the next release. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "I saw `cout' being shifted "Hello world" times to the left and stopped right there." --Steve Gonedes

Raymond Hettinger wrote:
Yes. Whatever MAL and Barry thinks, Python's current model is 8+8=8, U+U=U, and 8+U=U for ascii U. That's an advantage, not a bug.
Is everyone ready for unicode coercions to start sprouting everywhere?
No. And when that time comes, storing everything as 32-bit characters is not the right answer either. </F>

Fredrik Lundh wrote:
Indeed, but I don't see how that's different from what the PEP is saying.
I'll leave that for the libc designers to decide :-) If you look at performance, there's not much difference between 8-bit strings and Unicode, so the only argument against using Unicode for storing text data is memory usage. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 04 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

M.-A. Lemburg wrote:
the current implementation is T(8) % 8 = U. which violates the 8+8=8 rule.
I used to make that argument, but these days, I no longer think that you can talk about performance without taking memory usage into account. </F>

Fredrik Lundh wrote:
T is a sub-class of Unicode, so you have: U % 8 = U which is just fine.
You always have to take both into account. I was just saying that 8-bit strings don't buy you much in terms of performance over Unicode these days, so the only argument against using Unicode would be doubled memory usage. Of course, this is a rather mild argument given the problems you face when trying to localize applications - which I see as the main use case for templates. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 05 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

On Sun, Sep 05, 2004, M.-A. Lemburg wrote:
Only if one sticks with the 2-byte Unicode implementation; you can compile Python with 4-byte Unicode (and I seem to recall that at least one standard distribution does exactly that).
If I18N is intended to be the primary/only use case of templates, then the PEP needs to be updated. It would also explain some of the disagreement about the implementation. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "I saw `cout' being shifted "Hello world" times to the left and stopped right there." --Steve Gonedes

Raymond Hettinger wrote:
Hmm, I wonder why you cut away the first part: "First, please be aware that normal use of Templates is for formatting *text* data." This is the most important argument for making Template a Unicode-subclass. Coercion to Unicode then is a logical consequence and fully in line with what Python has been doing since version 1.6, ie. U=U+U and U=U+8 (to use /Fs notation).
IMO, it is subversive to start taking new string functions/methods and coercing their results to Unicode.
I don't understand... there's nothing subversive here. If strings meet Unicode the result gets coerced to Unicode. Nothing surprising here. Why are you guys putting so much effort into fighting Unicode ? I often get the impression that you are considering Unicode a nightmare rather than a blessing. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 04 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

On Sat, 2004-09-04 at 07:51, M.-A. Lemburg wrote:
Indeed. For example, the only way to maintain your sanity in an i18n'd application is to convert all text[1] to unicode as early as possible, deal with only unicode internally, and encode to 8bit strings as late as possible, if ever. -Barry [1] "text" defined as "strings intended for human consumption".

On Fri, 2004-09-03 at 16:37, M.-A. Lemburg wrote:
I think Barry needs some backup here.
Thanks MAL! I'll point out that Template was very deliberately subclassed from unicode, so Template instances /are/ unicode objects. From the standpoint of type conversion, using /F's notation, T(8) == U, thus because U % 8 == U, T(8) % 8 == U. Other than .encode() are there any other methods of unicode objects that return 8bit strings? I don't think so, so it seems completely natural that T % 8 returns U. Raymond is against the class-based implementation of PEP 292, but if you accept the class implementation of 292 (which I still believe is the right choice), then the fact that the mod operator always returns a unicode makes perfect sense. -Barry

That misses the point. Templates do not have to be unicode objects. Template can be their own class rather than a subclass of unicode. The application does not demand that unicode be mentioned at all. There seems to be a strong "just live with it" argument but no advantages are offered other than it matching your personal approach to text handling. Why force it when you don't have to. At least three of your users (me, Aahz, and Fred) do not want unicode output when we have str inputs.
Raymond is against the class-based implementation of PEP 292,
That choice is independent of the decision of whether to always coerce to unicode. Also, it more accurate to say that I think __mod__ operator is not ideal. If you want to stay with classes, Guido's __call__ syntax is also fine. It avoids the issues with %, makes it possible to have keyword arguments, and lets you take advantage of polymorphism. The % operator has several issues: * it is mnemonic for %(name)s substitution not $ formatting. * it is hard to find in the docs * it is does not accept tuple/scalar arguments like % formatting * its precedence is more appropriate for int.__mod__ Raymond

Raymond Hettinger wrote:
one of which wrote the original unicode implementation, and the mixed-type regular expression engine used to implement templates, and a very popular XML library that successfully uses mixed-type text to handle text faster and using less memory than all other Python XML libraries. I've shown over and over again that Unicode-aware text handling in Python doesn't have to be slow and bloated; I'd prefer if we kept it that way. </F>

On Sat, 2004-09-04 at 16:03, Raymond Hettinger wrote:
But it's damn convenient for them to be though. Please read the Internationalization section of the PEP. In addition to being able to use them directly as gettext catalog keys, I think there will be /a lot/ of scenarios where you won't want to care whether you have a Template or a unicode -- you will just want to treat everything as a unicode string without having to do tedious type checking.
<deep_breath> PEP 292 was a direct outgrowth of my experience in trying to internationalize an application and make it (much) easier for my translators to contribute. Many of them are not Python gurus and the existing % syntax is clearly a common tripping point. I'm convinced that the current design of PEP 292 is right for the use cases I originally designed it for. To be generous, if the three of you disagree, then it's because you have other requirements. That's fine; maybe they're just incompatible with mine. Maybe I did a poor job of explaining how my uses cases lead to the design of PEP 292. If all that's true, then PEP 292 can't be made general enough and should be rejected, and the code should be ripped out of the standard library. Let applications use whatever is appropriate for their own uses cases. Because PEP 292 is a library addition, Python itself won't suffer in the least. The implementations you proposed won't be of any use to me. Fortunately, the archives will be replete with all the alternatives for future software archaeologists. -Barry

Barry wrote:
from a user perspective, there's no reason to make templates a sub- class of unicode, so the rest of your argument is irrelevant. instead of looking at use patterns, you're stuck defending the existing code. that's not a good way to design usable code. </F>

On Sun, 2004-09-05 at 04:26, Fredrik Lundh wrote:
from a user perspective, there's no reason to make templates a sub- class of unicode, so the rest of your argument is irrelevant.
Not true. I had a very specific reason for making Templates subclasses of unicode. Read the Internationalization section of the PEP. -Barry

Barry wrote:
Not true. I had a very specific reason for making Templates subclasses of unicode. Read the Internationalization section of the PEP.
this section? The implementation supports internationalization magic by keeping the original string value intact. In fact, all the work of the special substitution rules are implemented by overriding the __mod__() operator. However the string value of a Template (or SafeTemplate) is the string that was passed to its constructor. This approach allows a gettext-based internationalized program to use the Template instance as a lookup into the catalog; in fact gettext doesn't care that the catalog key is a Template. Because the value of the Template is the original $-string, translators also never need to use %-strings. The right thing will happen at run-time. I don't follow: if you're passing a template to gettext, do you really get a template back? if that's really the case, is being able to write "_(Template(x))" really that much of an advantage over writing "Template(_(x))" ? if that's really the case, you can still get the same effect from a template factory function, or a trivial modification of gettext. </F>

Fredrik Lundh wrote:
Templates are meant to template *text* data, so Unicode is the right choice of baseclass from a design perspective.
instead of looking at use patterns, you're stuck defending the existing code. that's not a good way to design usable code.
Perhaps I'm missing something, but where would you use Templates for templating binary data (where strings or bytes would be a more appropriate design choice) ? -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 08 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Templates are meant to template *text* data, so Unicode is the right choice of baseclass from a design perspective.
Only in Python 3.0. But even so, deriving from Unicode (or str) means the template class inherits a lot of unwanted operations. While I can see that concatenating templates probably works, slicing them or converting to lowercase etc. make no sense. IMO the standard Template class should implement a "narrow" interface, i.e. *only* the template expansion method (__mod__ or something else), so it's clear that other compatible template classes shouldn't have to implement anything besides that. This avoids the issues we have with the mapping protocol: when does an object implement enough of the mapping API to be usable? That's currently ill-defined; sometimes, __getitem__ is all you need, sometimes __contains__ is required, sometimes keys, rarely setdefault. -- --Guido van Rossum (home page: http://www.python.org/~guido/) Ask me about gmail.

Guido van Rossum wrote:
We better start early to ever reach the point of making a clear distinction between text and binary data in P3k.
Looks like it's ont even clear what templating itself should mean... you're talking about a templating interface here, not an object type, like Barry is (for the sake of making Templates compatible to i18n tools like gettext). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 08 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

M.-A. Lemburg wrote
for the sake of making Templates compatible to i18n tools like gettext).
assuming that gettext really always returns a template if you hand it a template, of course. given that the 2.4 gettext doesn't seem to map templates to templates on my machine, that there's no sign of template support in the 2.4 gettext source code, and that Barry ignored my question about this, I have to assume that the I18N argument is yet another bogus argument. </F>

The introduction of a bytes type in Python 2.5 should be a good start.
I don't know zip about i18n or gettext. But I thought we had plenty of time since Barry has offered to withdraw the PEP 292 implementation for 2.4? -- --Guido van Rossum (home page: http://www.python.org/~guido/) Ask me about gmail.

On Wed, 2004-09-08 at 22:29, Guido van Rossum wrote:
But I thought we had plenty of time since Barry has offered to withdraw the PEP 292 implementation for 2.4?
Which I will still do if we cannot reach community agreement by beta1. But lets see how the latest proposal goes over. -Barry

On Wed, 2004-09-08 at 11:08, Guido van Rossum wrote:
Except that I think in general it'll just be very convenient for Templates to /be/ unicodes. But no matter. It seems like if we make Template a simple class, it will be possible for applications to mix in Template and unicode if they want. E.g. class UTemplate(Template, unicode). If we go that route, then I agree we probably don't want to use __mod__(), but I'm not too crazy about using __call__(). "Calling a template" just seems weird to me. Besides, extrapolating, I don't think we need separate Template and SafeTemplate classes. A single Template class can have both safe and non-safe substitution methods. So, I have working code that integrates these changes, and also uses Tim's metaclass idea to provide a nice, easy-to-document pattern overloading mechanism. I chose methods substitute() and safe_substitute() because, er, that's what they do, and those names also don't interfere with existing str or unicode methods. And to make effbot and Raymond happy, it won't auto-promote to unicode if everything's an 8bit string. I will check this in and hopefully this will put the issue to bed. There will be updated unit tests, and I will update the documentation and the PEP as appropriate -- if we've reached agreement on it. -Barry

[Barry]
And to make effbot and Raymond happy, it won't auto-promote to unicode if everything's an 8bit string.
Glad to see that my happiness now ranks as a development objective ;-)
There will be updated unit tests, and I will update the documentation and the PEP as appropriate -- if we've reached agreement on it.
+1 Beautiful job. Barry asked me to bring up one remaining implementation issue for discussion on python-dev. The docs clearly state that only python identifiers are allowed as placeholders: [_A-Za-z][_A-Za-z0-9]* The challenge is that templates can be exposed to non-programmer end-users with no reason to suspect that one letter of their alphabet is different from another. So, as it stands right now, there is a usability issue with placeholder errors passing silently: >>> fechas = {u'hoy':u'lunes', u'mañana':u'martes'} >>> t = Template(u'¿Puede volver $hoy o $mañana?') >>> t.safe_substitute(fechas) u'¿Puede volver lunes o $mañana?' The substitution failed silently (no ValueError as would have occurred with $@ or a dangling $). It may be especially baffling for the user because one placeholder succeeded and the other failed without a hint of why (he can see the key in the mapping, it just won't substitute). No clue is offered that the Template was looking for $ma, a partial token, and didn't find it (the situation is even worse if it does find $ma and substitutes an unintended value). I suggest that the above should raise an error: ValueError: Invalid token $mañana on line 1, column 24 It is easily possible to detect and report such errors (see an example in nondist/sandbox/string/curry292.py). The arguments against such reporting are: * Raymond is smoking crack. End users will never make this mistake. * The docs say python identifiers only. You blew it. Tough. Not a bug. * For someone who understands exactly what they are doing, perhaps $ma is the intended placeholder -- why force them to uses braces: ${ma}ñana. In addition to the above usability issue, there is one other nit. The new invocation syntax offers us the opportunity for to also accept keyword arguments as mapping alternatives: def substitute(self, mapping=None, **kwds): if mapping is None: mapping == kwds . . . When applicable, this makes for beautiful, readable calls: t.substitute(who="Barry", what="mailmeister", when=now()) This would be a simple and nice enchancement to Barry's excellent implementation. I recommend that keyword arguments be adopted. Raymond

On Fri, 2004-09-10 at 01:50, Raymond Hettinger wrote:
Well, if I want to get other work done... :)
Cool!
It also makes it more difficult to document. IOW, right now the PEP and the documentation say that the first non-identifier character terminates the placeholder. How would you word the rules with your change?
My only problem with that is the interference that the 'mapping' argument presents. IOW, kwds can't contain 'mapping'. We could solve that in a couple of ways: 1. ignore the problem and tell people not to do that 2. change 'mapping' to something less likely to collide, such as '_mapping' or '__mapping__', and then see #1. 3. get rid of the mapping altogether and only have kwds. This would change the non-keyword invocation from mytemplate.substitute(mymapping) to mytemplate.substitute(**mymapping) A bit uglier and harder to document. Note that there's also a potential collision on 'self'. -Barry

"""Placeholders must be a valid Python identifier (containing only ASCII alphanumeric characters and an underscore). If an unbraced identifier ends with a non-ASCII alphanumeric character, such as the latin letter n with tilde in $mañana, then a ValueError is raised for the specious identifier.
My only problem with that is the interference that the 'mapping' argument presents. IOW, kwds can't contain 'mapping'.
To support a case where both a mapping and keywords are present, perhaps an auxiliary class could simplify matters: def substitute(self, mapping=None, **kwds): if mapping is None: mapping = kwds elif kwds: mapping = _altmap(kwds, mapping) . . . class _altmap: def __init__(self, primary, secondary): self.primary = primary self.secondary = secondary def __getitem__(self, key): try: return self.primary[key] except KeyError: return self.secondary[key] This matches the way keywords are used with the dict(). Raymond

Raymond Hettinger wrote:
I don't think any of this is needed. If a non-programmer is being told to use string substitution chances are someone is either going to explain it to them or there will be another set of docs to explain things in a simple way. I suspect stating exactly what a valid Python identifier contains as you did in parentheses above will be enough. -Brett

Brett C. wrote:
Also, since Barry has gone to great lengths to make Template overrideable, applications can replace the regular expression in their derived Template class when there is a need to allow for end-users inputing template strings. So, I'd suggest keeping safe_substitute relatively simple, but document the limitation and/or solution. Thanks, -Shane Holloway

[Brett]
I suspect stating exactly what a valid Python identifier contains as you did in parentheses above will be enough.
Given the template, u'¿Puede volver $hoy o $mañana?', you think $ma is an intended placeholder name and that ñ should be a delimiter just like whitespace and punctuation? If end users always follow the rules, this will never come up. If they don't, should there be error message or a silent failure? Raymond

Raymond Hettinger wrote:
No, I think Brett (and apparently nearly everybody else) thinks that such a template will not be written over the course of the next five years, except for demonstration purposes. Instead, what will be written is u'¿Puede volver $today o $tomorrow?' because the template will be a translation of the original English template, and, during translation, placeholder names must not be changed (although I have difficulties imagining possible values for today or tomorrow so that this becomes meaningful).
If end users always follow the rules, this will never come up. If they don't, should there be error message or a silent failure?
There is always a chance of a silent failure in SafeTemplates, even with this rule added - this is the purpose of SafeTemplates. With a Template, you will get a KeyError. In any case, the failure will not be completely silent, as the user will see $mañana show up in the output. My prediction is that the typical application is to use Templates, as users know very well what the placeholders are. Furthermore, the typical application will use locals/globals/vars(), or dict(key="value") to create the replacement dictionary. In this application, nobody would even think of using mañana as a key, because you can't get it into the dictionary. If this never comes up, it is better to not complicate the rules. Simple is better than complex. Regards, Martin

Martin v. Löwis wrote:
Actually, that wasn't what I was thinking, but that also works. My original thinking is that Template will throw a fit and that's fine since they didn't follow the rules.
Right, my other reason for not thinking this is a big issue. If you use SafeTemplate you will have to watch out for silent problems like this anyway. I just don't think it will be a big problem. And if people want the support they will just use a pure Unicode Template subclass (perhaps we should include that in the module?). -Brett

On Sat, 2004-09-11 at 04:39, "Martin v. Löwis" wrote:
I tend to agree, so I'd like to keep the rules as they currently stand. Your prediction is aligned with what I think the most common use cases are too. -Barry

Raymond Hettinger wrote:
so why keep the python identifier limitation? the RE engine you're using to parse the template has a concept of "alphanumeric character". just define the placeholder syntax as "one or more alphanumeric characters or under- scores" (\w+), use re.UNICODE if the template is created from a unicode string, and you're done. this doesn't mean that people *have* to use non-ASCII characters, of course. but if they do, things just work. </F>

On Fri, 2004-09-10 at 18:22, Raymond Hettinger wrote:
This matches the way keywords are used with the dict().
This isn't exactly what I was concerned about, but I agree that it's a worthwhile approach. (I'm going to accept your patch and check it in, with slight modifications.) What I was worried about was if you providing 'mapping' positionally, and kwds contained a 'mapping' key, you'll get a TypeError. I'm going to change the positional argument to '__mapping' so collisions of that kind are less likely, and will document it in libstring.tex. -Barry

M.-A. Lemburg wrote:
not true. as I've shown in SRE and ElementTree (just to give a few examples), 8-bit strings are superior for the *huge* subset of all text strings that only contain ASCII data.
8-bit strings != binary data. you clearly haven't read my other posts in this thread. please do that, instead of repeating the same bogus arguments over again. </F>

Fredrik Lundh wrote:
I've read them all and, to be honest, I don't follow your argumentation. The text interpretation of 8-bit strings is only one possible form of their interpretation. You could just as well have image data in your 8-bit string and calling .lower() on such a string is certainly going to render that image data useless. The whole point in adding Unicode to the language was to make the difference between text and binary data clear and visible at the type level. I'm not saying that you can not store text data in 8-bit strings, but that we should start to make use of the distinction between text and binary data. If we start to store text data in Unicode now and leave binary data in 8-bit strings, then the move to Unicode strings literals will be much smoother in P3k. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 08 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

M.-A. Lemburg wrote:
well, when I wrote the Unicode type, the whole point was to be able to make it easy to handle Unicode text. no more, no less.
hopefully, the P3K string design will take a lot more into account than text-vs-binary; there are many ways to represent text, and many ways to store binary data, and many usage patterns for them both. a good design should take most of this into account. (google for "stringlib" for some work I'm doing in this area) </F>

Fredrik Lundh wrote:
... and the Unicode integration made that a reality :-) In todays globalized world, the only sane way to deal with different scripts is through Unicode, which is why I believe that text data should eventually always be stored in Unicode objects - regardless of whether it takes more memory or not. (If you compare development time to prices of a few GB extra RAM, the effort needed to maintain text in non-Unicode formats simply doesn't pay off anymore.)
Ah, now I know where you're coming from :-) Shift tables don't work well in the Unicode world with its large alphabet. BTW, you might want to look at the BMS implementation I did for mxTextTools. Here's a nice reference for pattern matching: http://www-igm.univ-mlv.fr/~lecroq/string/index.html -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 08 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Marc-Andre Lemburg wrote:
This is not as obvious as it seems, because the "few GB extra RAM" is a price paid by everyone who *uses* the software. Granted, it's quite common for software to be only run ever on one or two machines in the company where it was developed, but not all software is used that way. Also: the price of "a few GB extra RAM" is not always as low as it seems. If adding 2GB means moving from 3GB to 5GB, it may mean replacing the CPU and the OS. That said, I strongly agree that all textual data should be Unicode as far as the developer is concerned; but, at least in the USA :-), it makes sense to have an optimized representation that saves space for ASCII-only text, just as we have an optimized representation for small integers. (The benefit is potentially much greater in that case, though.) -- g

"Gareth" == Gareth McCaughan <gmccaughan@synaptics-uk.com> writes:
Gareth> That said, I strongly agree that all textual data should Gareth> be Unicode as far as the developer is concerned; but, at Gareth> least in the USA :-), it makes sense to have an optimized Gareth> representation that saves space for ASCII-only text, just Gareth> as we have an optimized representation for small integers. This is _not at all_ obvious. As MAL just pointed out, if efficiency is a goal, text algorithms often need to be different for operations on texts that are dense in an 8-bit character space, vs texts that are sparse in a 16-bit or 20-bit character space. Note that that is what </F> is talking about too; he points to SRE and ElementTree. When viewed from that point of view, the subtext to </F>'s comment is "I don't want to separately maintain 8-bit versions of new text facilities to support my non-Unicode applications, I want to impose that burden on the authors of text-handling PEPs." That may very well be the best thing for Python; as </F> has done a lot of Unicode implementation for Python, he's in a good position to make such judgements. But the development costs MAL refers to are bigger than you are estimating, and will continue as long as that policy does. While I'm very sympathetic to </F>'s view that there's more than one way to skin a cat, and a good cat-handling design should account for that, and conceding his expertise, none-the-less I don't think that Python really wants to _maintain_ more than one text-processing system by default. Of course if you restrict yourself to the class of ASCII- only strings, you can do better, and of course that is a huge class of strings. But that, as such, is important only to efficiency fanatics. The question is, how often are people going to notice that when they have pure ASCII they get a 100% speedup, or that they actually can just suck that 3GB ASCII file into their 4GB memory, rather than buffering it as 3 (or 6) 2GB Unicode strings? Compare how often people are going to notice that a new facility "just works" for Japanese or Hindi. I just don't see the former being worth the extra effort, while the latter makes the "this or that" choice clear. If a single representation is enough, it had better be Unicode-based, and the others can be supported in libraries (which turn binary blobs into non-standard text objects with appropriate methods) as the need arises. -- Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.

On Friday 2004-09-10 06:38, Stephen J. Turnbull wrote:
I hope you aren't expecting me to disagree.
How do you know what I am estimating?
No, it's important to ... well, people to whom efficiency matters. There's no need for them to be fanatics.
Why is that the question, rather than "how often are people going to benefit from getting a 100% speedup when they have pure ASCII"? Or even "how often are people going to try out Python on an application that uses pure-ASCII strings, and decide to use some other language that seems to do the job much faster"?
No question that if a single representation is enough then it had better be Unicode. -- g

"Gareth" == Gareth McCaughan <gmccaughan@synaptics-uk.com> writes:
Gareth> On Friday 2004-09-10 06:38, Stephen J. Turnbull wrote: >> But [efficiency], as such, is important only to efficiency >> fanatics. Gareth> No, it's important to ... well, people to whom efficiency Gareth> matters. There's no need for them to be fanatics. If it matters just because they care, they're fanatics. If it matters because they get some other benefit (response time less than the threshold of hotice, twice as many searches per unit time, half as many boxes to serve a given load), they're not. </F>'s talk of many ways to do things "and Python should account for most of them" strikes me as fanaticism by that definition; the vast majority of developers will never deal with the special cases, or write apps that anticipate dealing with huge ASCII strings. Those costs should be borne by the developers who do, and their clients. I apologize for shoehorning that into my reply to you. >> The question is, how often are people going to notice that when >> they have pure ASCII they get a 100% speedup [...]? Gareth> Why is that the question, rather than "how often are Gareth> people going to benefit from getting a 100% speedup when Gareth> they have pure ASCII"? Because "benefit" is very subjective for _one_ person, and I don't want to even think about putting coefficients on your benefit versus mine. If the benefit is large enough, a single person will be willing to do the extra work. The question is, should all Python users and developers bear some burden to make it easier for that person to do what he needs to do? I think "notice" is something you can get consensus on. If a lot of people are _noticing_ the difference, I think that's a reasonable rule of thumb for when we might want to put "it", or facilities for making individual efforts to deal with "it" simpler, into "standard Python" at some level. If only a few people are noticing, let them become expert at dealing with it. Gareth> Or even "how often are people going to try out Python on Gareth> an application that uses pure-ASCII strings, and decide to Gareth> use some other language that seems to do the job much Gareth> faster"? See? You're now using a "notice" standard, too. I don't think that's an accident. >> I just don't see the former being worth the extra effort, while >> the latter makes the "this or that" choice clear. If a single >> representation is enough, it had better be Unicode-based, and >> the others can be supported in libraries (which turn binary >> blobs into non-standard text objects with appropriate methods) >> as the need arises. Gareth> No question that if a single representation is enough then Gareth> it had better be Unicode. Not for you, not for me, not for </F>, I'm pretty sure. The point here is that there is a reasonable way to support the others, too, but their users will have to make more effort than if it were a goal to support them in the "standard language and libraries." I think that's the way to go, and </F> thinks the opposite AFAICT. -- Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp University of Tsukuba

On Saturday 2004-09-11 08:35, Stephen J. Turnbull wrote:
I am unconvinced that "the vast majority of developers" will not have work to do that involves a large volume of ASCII data ... but I'm not sure this is something either of us is in a position to know. (If it turns out that you're just completing a PhD thesis entitled "Use of large-volume string data among software developers", or something, then please accept my apologies for guessing wrong and enlighten me!)
I apologize for shoehorning that into my reply to you.
That's OK.
"Burden" is just as subjective as "benefit". But let's take a look at these burdens and benefits. - Burden for a very small number of Python developers: having to write and maintain a larger body of code, with duplication (at least of purpose) between Unicode and ASCII strings. - Consequent burden on all Python users: more risk of those developers getting burned out and giving up, less time for them to work on other aspects of Python, more danger of bugs in code, larger executables. They won't notice this, of course. + Benefit for a small (but nearly so small) number of Python users: important code runs twice as fast, and this makes a real difference to them. + Consequent benefit for all Python users: more use of Python means more people contributing code, bug reports, useful libraries, etc. They won't notice this, either. + Benefit for all Python users: some of their code runs a little faster. They won't notice this, either. Perhaps I'm being obtuse, but it's far from clear to me that this is a net loss for Python users at large. In any case, the burdens seem less likely to be noticed than the benefits.
But even if "noticing the difference" is the key point, it is a mistake (I think) to make it specifically "noticing that when they have pure ASCII they get a 100% speedup". Hence my comment quoted below:
It isn't. It's because I was replying to someone who apparently took "notice" standards as the only relevant ones, in order to point out that even with that assumptions there are relevant questions other than "will anyone notice getting a speedup when their data are pure ASCII?". And I, in turn, apologize for shoehorning all *that* into the word "even". :-) I still think, though, that a "notice" standard makes for bad designs. Most people would not notice if all floating-point operations gave results with the last couple of bits wrong, but it is a good thing that they don't. Some people wouldn't notice but would get badly unsatisfactory results. Some people would notice but would find it impractical to work around the problems because that would mean tons of code and major losses in speed. Most people would not notice if by inserting the magic word "wibble" at the start of their programs they could make them 10 times faster, but if for some weird reason it were possible to make that so (but not possible to provide the speedup for programs without "wibble") then it should be done. What people notice is easier to define and to measure than what actually makes a difference to them. That is not enough reason to treat it as the only criterion. -- g

"Gareth" == Gareth McCaughan <gmccaughan@synaptics-uk.com> writes:
Gareth> I am unconvinced that "the vast majority of developers" Gareth> will not have work to do that involves a large volume of Gareth> ASCII data ... but I'm not sure this is something either Gareth> of us is in a position to know. Oh, I'm pretty sure that an awful lot of developers _will_ have work to do that involves large volumes of ASCII data. The question is how much will that work be facilitated by having all (as opposed to a few well-chosen) text processing features support returning 8-bit strings as well as Unicodes? Gareth> Perhaps I'm being obtuse, but it's far from clear to me Gareth> that this is a net loss for Python users at large. It's not clear to me, either. I am just not convinced by hand-waving that says "there's no difference between human text processing and other text processing, so any text processing facility should be available in an 8-bit version." Maybe that's a straw man, but that's what </F> was advocating AFAICT. Gareth> I still think, though, that a "notice" standard makes for Gareth> bad designs. We're not talking about design here, IMO. We're talking about requirements. Of course if you're going to implement a capability, you should design it "right." Gareth> What people notice is easier to define and to measure than Gareth> what actually makes a difference to them. That is not Gareth> enough reason to treat it as the only criterion. It's not. What I'm saying is that if very few people see a noticable difference, it should be left up to those few to implement what they need. -- Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.

On Fri, Sep 10, 2004, Stephen J. Turnbull wrote:
That's a good point, and that's what Python is moving toward. The thing is, we currently have two text processing systems, and there's no reason (given Python's dynamic dispatch capabilities) to treat one of them as second-class for this issue. It's particularly onerous in this instance because Unicode is unfortunately second-class in a number of respects, and doing what is in some respects a silent switch here would be needlessly confusing and irritating for users. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines." --Ralph Waldo Emerson

M.-A. Lemburg wrote:
since most real-life text use characters from only a small number of regions in that alphabet, compressed shift tables work extremely well (the algorithm on the stringlib page shows one way to do that, in constant space and O(m) time).
BTW, you might want to look at the BMS implementation I did for mxTextTools.
did you ever get around to add Unicode support to mxTextTools ? </F>

Fredrik Lundh wrote:
You mean: a compressed shift table for Unicode patterns ? I'll have a look.
Yes in egenix-mx-base 2.1.0. It's not yet released, but Google will find the most recent snapshot :-) The package has been available as beta for more than a year now; just haven't found time to cut a release. The search functions from 2.0 were replaced with search objects that can deal with both 8-strings and Unicode. However, the Unicode search implementation uses a rather naive approach due to the shift table problem (and my lack of time). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 11 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

M.-A. Lemburg wrote:
You mean: a compressed shift table for Unicode patterns ? I'll have a look.
It's a lossy compression: the entire delta1 table is represented as two 32-bit values, independent of the size of the source alphabet. Works amazingly well, at least when combined with the BM-variant it was designed for... (I suppose it's too late for 2.4, but it would probably be a good idea to switch to this algorithm in 2.5) </F>

Fredrik Lundh wrote:
Here's a reference that might be interesting for you: http://citeseer.ist.psu.edu/boldi02compact.html They use statistical approaches to dealing with the problem of large alphabets. Their motivation is making Java's Unicode string implementation faster... sounds familiar, eh :-) Their motivation was based on work done for the "Managing Gigabytes" project: http://www.cs.mu.oz.au/mg/ and http://www.mds.rmit.edu.au/mg/ Too bad their code is GPLed, but I suppose getting some ideas is OK ;-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 13 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

M.-A. Lemburg wrote:
thanks for the reference. but I have to admit that I found the following paper by the same authors to be more interesting ... http://citeseer.ist.psu.edu/boldi03rethinking.html ... both because they've looked into efficient designs for mutable strings, and because of how they use a 32-bit "bloom filter" hashed by the least significant bits in the Unicode characters... oh well, there are never any new ideas ;-) </F>

"Fredrik" == Fredrik Lundh <fredrik@pythonware.com> writes:
Fredrik> M.-A. Lemburg wrote: >>> (google for "stringlib" for some work I'm doing in this area) >> Ah, now I know where you're coming from :-) Shift tables don't >> work well in the Unicode world with its large alphabet. Fredrik> since most real-life text use characters from only a Fredrik> small number of regions in that alphabet, This is true of "most real-life text", but it's going to be false most of the time for a large (and rapidly growing) minority of users: those working with texts comprised mostly of Asian ideographs. Unihan (spread over about 80 256-character rows) has a potential big problem: because it is ordered by root, then stroke count, the simpler (and usually more frequently used) ideographs with a common root cluster near the root. Whether those clusters frequently overlap based on a simple compression method like "lowest 5 bits" I don't know offhand. I don't know whether the composed Hangul (~ 40 rows) would show clustering; that would depend on phonetic frequencies in the Korean language. Of course the find algorithm you present is almost surely a big win over the brute-force method, even in the presence of some degree of clustering in Unihan and Hangul. But I worry that it's an exceptional example, when you use assumptions like "real-life text uses characters drawn from a small number of short contiguous regions in the alphabet." -- Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.

Stephen J. Turnbull wrote:
The problem is that I cannot tell if you've studied search issues, or if you're just applying general "but wait, it's different for asian languages" arguments here. There are many issues here, all pulling in different directions: - If you look at usage statistics, you'll find that the absolute majority of all searches are for a single character (usually separators, like colons, spaces, commas). The second largest category is computer-level keywords (usually pure ASCII, also in localized programs), used to process network protocols, file formats, message headers, etc. Searches for "human text" are not that common, really, and search terms are usually limited to only a few words. - This means that most searches have exactly the same characteristics, independent of the locale. Even if a new algorithm would only be better for pure-ASCII text, everyone would benefit. - As for non-ASCII search terms, the "human text" search terms are usually shorter in languages with many ideographs (my non-scientific tests indicate that chinese text uses about 4 times less symbols than english; I'm sure someone can dig up better figures). - This means that even if you are more likely to get collisions in the compressed skip table, there are fewer characters in the table. - This means that you'll probably be able to make long skips as often as for non-Asian text. - On the other hand, the long skips are shorter than for non-Asian text, so you may have to make more of them. - On the other hand, the target strings are also likely to be shorter, so that might not matter. - And so on. The only way to know for sure is if anyone has the time and energy to carry out tests on real-life datasets. (or at least prepare some datasets; I can run the tests if someone provides me with a file with search terms and a number of files containing texts to apply them to, preferrably using UTF-8 encoding). </F>

"Fredrik" == Fredrik Lundh <fredrik@pythonware.com> writes:
Fredrik> Stephen J. Turnbull wrote: >> But I worry that it's an exceptional example, when you use >> assumptions like "real-life text uses characters drawn from a >> small number of short contiguous regions in the alphabet." Fredrik> The problem is that I cannot tell if you've studied Fredrik> search issues, Enough to understand Boyer-Moore and how the proposed algorithm differs, and to recognize that your statements about the distribution of search applications are true. Not that I want to argue about search, I'm all in favor of better search. I was startled to read that Python still uses a brute-force algorithm for searching. My point about distribution of ideographs was simply that you made an unjustified assumption in the context of what is (to me, anyway) an important subdomain of text processing. Here, it is "obviously harmless," but that's because brute force search is so bad. In other applications, or with a better status quo, there very well may be real tradeoffs between what's good for 8-bit text and what's good for Unicode. Fredrik> or if you're just applying general "but wait, it's Fredrik> different for asian languages" arguments here. No, I know that ostrich won't fly. Fredrik> Searches for "human text" are not that common, really, Fredrik> and search terms are usually limited to only a few words. In the context of PEP 292 is a focus on "human text" unwarranted? After all, what motivated the PEP and the implementation was evidently "human text" processing. In my experience, the notation for interpolation it uses would have much bigger advantages over the format string style for "human text" than for the "non-human text" applications I know of. Not that it's useless for the latter, just that it's much more of a luxury there. If that's valid, there's a point where it makes sense for people who develop human-text-oriented features based on Unicode strings to say "pick the features you really want for 8-bit strings, because you have to support them yourselves." Fredrik> The only way to know for sure is if anyone has the time Fredrik> and energy to carry out tests on real-life datasets. (or Fredrik> at least prepare some datasets; I can prepare datasets and do some statistical work for Japanese, but it probably won't happen this month. Sounds like a worthwhile thing to have around, though. -- Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.

Stephen J. Turnbull wrote:
In the context of PEP 292 is a focus on "human text" unwarranted?
I'm pretty sure this subthread left the PEP quite a few posts ago. The rest of us were talking about string searches, of the find/replace/split variety. </F>

"Fredrik" == Fredrik Lundh <fredrik@pythonware.com> writes:
Fredrik> Stephen J. Turnbull wrote: >> In the context of PEP 292 is a focus on "human text" >> unwarranted? Fredrik> I'm pretty sure this subthread left the PEP quite a few Fredrik> posts ago. That's a funny way to spell "I don't like the way this is going, good-bye", but it works for me. <wink> Have a nice day, thanks for the information on search algorithms and usage patterns. -- Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.

"Fredrik Lundh" <fredrik@pythonware.com> wrote in message news:ci3g2d$m3g$1@sea.gmane.org...
This is why I am not especially enamored of Unicode and the prospect of Python becoming married to it. It is heavily weighted in favor of efficiently representing Chinese and inefficiently representing English. To give English equivalent treatment, the 20,000 or so most common words, roots, prefixes, and suffixes would each get its own codepoint. Terry J. Reedy

[Terry Reedy]
[Unicode] is heavily weighted in favor of efficiently representing Chinese and inefficiently representing English.
You undoubtedly forgot the smiley! :-) Many people consider that Unicode, or UTF-8 at least, is strongly favouring English (boldly American) over any other script or language. If it has not been so, Americans would never have promoted it so much, and would have rather shown an infinite and eternal reluctance... -- François Pinard http://www.iro.umontreal.ca/~pinard

Terry Reedy wrote:
Hmm, the Asian world has a very different view on these things. Representing English ASCII text in UTF-8 is very efficient (1-1), while typical Asian texts use between 1.5-2 times as much space as their equivalent in one of the resp. Asian encodings, e.g. take the Japanese translation of the bible from (only parts of New Testament): http://www.cozoh.org/denmo/
Some stats: ----------- Number of unique code points: 1512 Code point frequency (truncated): u'\u305f' : ================================= u' ' : ============================= u'\u306e' : =========================== u'\uff0c' : ========================== u'\r' : ======================== u'\n' : ======================== u'\u306b' : ===================== u'\u3044' : ================= u'\u3066' : ================= u'\u3057' : ================ u'\u3002' : ================ u'\u306f' : ================ u'\u306a' : =============== u'\u3092' : ============== u'\u3068' : ============ u'\u308b' : ============ u'\u3089' : =========== u'\u3063' : =========== u':' : =========== u'}' : =========== u'{' : =========== u'\u304c' : ========== u'\u308c' : ========== u'\u304b' : ========= u'\u3067' : ========= u'1' : ========= u'\u5f7c' : ======== u'\u3053' : ======== u'\u3042' : ======= u'\u3061' : ======= u'\u3046' : ======= u'2' : ======= ... As you can see, most code points live in the 0x3000 area. These code points require 3 bytes in UTF-8, 2 bytes in UTF-16.
To give English equivalent treatment, the 20,000 or so most common words, roots, prefixes, and suffixes would each get its own codepoint.
I suggest you take this one up with the Unicode Consortium :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 14 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

On Sep 14, 2004, at 2:54 AM, Terry Reedy wrote:
Of course it is perfectly possible to have the Python unicode implementation choose to represent some unicode strings with only 8 bits per character. There is no (conceptual) reason it could not represent (u'a' * 8) with 8 bytes + class header overhead. That is simply an implementation detail and really has nothing to do with Unicode itself. It would also be possible to use UTF-8 string storage, although this has the tradeoff that indexing an element takes linear time w.r.t. position instead of constant time. James

James Y Knight:
At the cost of additional storage, indexing into UTF-8 by character rather than byte can be made better than linear. Two techniques are (1) maintain a list containing the byte index of some character index values (such as each line start) then use linear access from the closest known index and (2) to cache the most recent access due to the likelihood that the next access will be close. While I have thought about this problem, it has only once came up seriously for Scintilla (an editing component) and that was when someone was trying to provide a UCS2 facade that matched existing interfaces. Neil

On Thu, Aug 26, 2004, Raymond Hettinger wrote:
* doesn't force result to unicode
This is the main reason I'm +0, pending further arguments. OTOH, I also like using %, so you'd have to come up with more points to move me beyond +0. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "To me vi is Zen. To use vi is to practice zen. Every command is a koan. Profound to the user, unintelligible to the uninitiated. You discover truth everytime you use it." --reddy@lion.austin.ibm.com

On Thu, 2004-08-26 at 17:38, Raymond Hettinger wrote:
Weren't you the one who gave the Cheetah example? What was interesting about that was that the instance's attributes formed the substitution namespace. That's a use case I instantly liked. So there you have state attached to an instance. Another case for that would be in i18n applications where you might want to attach information such as the gettext domain to the instance. You might also want to build up the namespace in several locations, and delay performing the substitution until the last possible moment. In all those cases you have state attached to an instance (and would immediately invent such an instance for those use cases if you didn't have one).
To me that's not a disadvantage. For i18n applications, unicode is the only reasonable thing for human readable text. str's are only useful for byte arrays <wink>. It's not a disadvantage for Jython or IronPython either. :)
The mod operator was chosen because that's what people are familiar with, but it would probably be okay to pick a different method name. I think Guido's suggested using __call__() -- which I want to think more about.
This is doubly true if the Template instantiation is remote from the operator application.
Which, in some use cases, it most definitely will be. -Barry

[Raymond]
[Barry]
To me that's not a disadvantage.
By not inheriting from unicode, the bug can be fixed while retaining a class implementation (see sandbox\curry292.py for an example). But, be clear, it *is* a bug. If all the inputs are strings, Unicode should not magically appear. See all the other string methods as an example. Someday, all will be Unicode, until then, some apps choose to remain Unicode free. Also, there is a build option to not even compile Unicode support -- it would be a bummer to have the $ templates fail as a result. Raymond P.S. Here's the doctest from the sandbox code. What is at issue is the result of the first test:

On Mon, 2004-08-30 at 01:48, Raymond Hettinger wrote:
But the Template classes aren't string methods, so I don't think the analogy is quite right. Because the template string itself is by definition a Unicode, it actually makes more sense that everything its mod operator returns is also a Unicode. So I still don't think it's a bug.
Maybe. Like the doctor says, well, don't do that! (i.e. use Templates and disable unicode). -Barry

Templates are not Unicode by definition. That is an arbitrary implementation quirk and a design flaw. The '%(key)s' forms do not behave this way. They return str unless one of the inputs are unicode. People should be able to use Python and not have to deal with Unicode unless that is an intentional part of their design. Unless there is some compelling advantage to going beyond the PEP and changing all the rules, it is a bug. Raymond

Raymond Hettinger wrote:
I think Barry needs some backup here. First, please be aware that normal use of Templates is for formatting *text* data. Second, it is good design and good practice to store text data in Unicode objects, because that's what they were designed for, while string objects have always been an abstract container for storing bytes with varying meanings and interpretations. The latter is a design flaw that needs to get fixed, not the choice of Unicode as Template base class. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 03 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

[MAL]
IMO, it is subversive to start taking new string functions/methods and coercing their results to Unicode. Someday we may be there, Py3.0 perhaps, but str is not yet deprecated. Until then, a user should reasonably expect SISO str in, str out. This is doubly true when the rest of python makes active efforts to avoid SIUO (see % formatting and''.join() for example). Someday Guido may get wild and turn all text uses of str into unicode. Most likely, it will need a PEP so that all the issues get thought through and everything gets changed at once. Slipping this into the third alpha as if it were part of PEP292 is not a good idea. The PEP was about simplification. Tossing in unnecessary unicode coercions is not in line with that goal. Does anyone else think this is a crummy idea? Is everyone ready for unicode coercions to start sprouting everywhere? Raymond

On Fri, Sep 03, 2004, Raymond Hettinger wrote:
+0 (agreeing with Raymond) Correct me if I'm wrong, but there are a couple of issues here: * First of all, I believe that unicode strings are interoperable (down to hashing) with 8-bit strings, as long as there are no non-7-bit ASCII characters. Where things get icky is with encoded 8-bit strings making use of e.g. Latin-1. So the question is whether we need full interoperability. * Unicode strings take four bytes per character (not counting decomposed characters). Is it fair at this point in Python's evolution to force this kind of change in performance metric, essentially silently? The PEP and docs do make the issue of Unicode fairly clear up-front, so anyone choosing to use template strings knows what zie is getting into. But what about someone grabbing a module that uses template strings internally?.... OTOH, I'm not up for making a big issue out of this. If Raymond really is the only person who feels strongly about it, it probably isn't going to be a big deal in practice. In addition, I think it's the kind of change that could be easily fixed in the next release. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "I saw `cout' being shifted "Hello world" times to the left and stopped right there." --Steve Gonedes

Raymond Hettinger wrote:
Yes. Whatever MAL and Barry thinks, Python's current model is 8+8=8, U+U=U, and 8+U=U for ascii U. That's an advantage, not a bug.
Is everyone ready for unicode coercions to start sprouting everywhere?
No. And when that time comes, storing everything as 32-bit characters is not the right answer either. </F>

Fredrik Lundh wrote:
Indeed, but I don't see how that's different from what the PEP is saying.
I'll leave that for the libc designers to decide :-) If you look at performance, there's not much difference between 8-bit strings and Unicode, so the only argument against using Unicode for storing text data is memory usage. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 04 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

M.-A. Lemburg wrote:
the current implementation is T(8) % 8 = U. which violates the 8+8=8 rule.
I used to make that argument, but these days, I no longer think that you can talk about performance without taking memory usage into account. </F>

Fredrik Lundh wrote:
T is a sub-class of Unicode, so you have: U % 8 = U which is just fine.
You always have to take both into account. I was just saying that 8-bit strings don't buy you much in terms of performance over Unicode these days, so the only argument against using Unicode would be doubled memory usage. Of course, this is a rather mild argument given the problems you face when trying to localize applications - which I see as the main use case for templates. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 05 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

On Sun, Sep 05, 2004, M.-A. Lemburg wrote:
Only if one sticks with the 2-byte Unicode implementation; you can compile Python with 4-byte Unicode (and I seem to recall that at least one standard distribution does exactly that).
If I18N is intended to be the primary/only use case of templates, then the PEP needs to be updated. It would also explain some of the disagreement about the implementation. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "I saw `cout' being shifted "Hello world" times to the left and stopped right there." --Steve Gonedes

Raymond Hettinger wrote:
Hmm, I wonder why you cut away the first part: "First, please be aware that normal use of Templates is for formatting *text* data." This is the most important argument for making Template a Unicode-subclass. Coercion to Unicode then is a logical consequence and fully in line with what Python has been doing since version 1.6, ie. U=U+U and U=U+8 (to use /Fs notation).
IMO, it is subversive to start taking new string functions/methods and coercing their results to Unicode.
I don't understand... there's nothing subversive here. If strings meet Unicode the result gets coerced to Unicode. Nothing surprising here. Why are you guys putting so much effort into fighting Unicode ? I often get the impression that you are considering Unicode a nightmare rather than a blessing. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 04 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

On Sat, 2004-09-04 at 07:51, M.-A. Lemburg wrote:
Indeed. For example, the only way to maintain your sanity in an i18n'd application is to convert all text[1] to unicode as early as possible, deal with only unicode internally, and encode to 8bit strings as late as possible, if ever. -Barry [1] "text" defined as "strings intended for human consumption".

On Fri, 2004-09-03 at 16:37, M.-A. Lemburg wrote:
I think Barry needs some backup here.
Thanks MAL! I'll point out that Template was very deliberately subclassed from unicode, so Template instances /are/ unicode objects. From the standpoint of type conversion, using /F's notation, T(8) == U, thus because U % 8 == U, T(8) % 8 == U. Other than .encode() are there any other methods of unicode objects that return 8bit strings? I don't think so, so it seems completely natural that T % 8 returns U. Raymond is against the class-based implementation of PEP 292, but if you accept the class implementation of 292 (which I still believe is the right choice), then the fact that the mod operator always returns a unicode makes perfect sense. -Barry

That misses the point. Templates do not have to be unicode objects. Template can be their own class rather than a subclass of unicode. The application does not demand that unicode be mentioned at all. There seems to be a strong "just live with it" argument but no advantages are offered other than it matching your personal approach to text handling. Why force it when you don't have to. At least three of your users (me, Aahz, and Fred) do not want unicode output when we have str inputs.
Raymond is against the class-based implementation of PEP 292,
That choice is independent of the decision of whether to always coerce to unicode. Also, it more accurate to say that I think __mod__ operator is not ideal. If you want to stay with classes, Guido's __call__ syntax is also fine. It avoids the issues with %, makes it possible to have keyword arguments, and lets you take advantage of polymorphism. The % operator has several issues: * it is mnemonic for %(name)s substitution not $ formatting. * it is hard to find in the docs * it is does not accept tuple/scalar arguments like % formatting * its precedence is more appropriate for int.__mod__ Raymond

Raymond Hettinger wrote:
one of which wrote the original unicode implementation, and the mixed-type regular expression engine used to implement templates, and a very popular XML library that successfully uses mixed-type text to handle text faster and using less memory than all other Python XML libraries. I've shown over and over again that Unicode-aware text handling in Python doesn't have to be slow and bloated; I'd prefer if we kept it that way. </F>

On Sat, 2004-09-04 at 16:03, Raymond Hettinger wrote:
But it's damn convenient for them to be though. Please read the Internationalization section of the PEP. In addition to being able to use them directly as gettext catalog keys, I think there will be /a lot/ of scenarios where you won't want to care whether you have a Template or a unicode -- you will just want to treat everything as a unicode string without having to do tedious type checking.
<deep_breath> PEP 292 was a direct outgrowth of my experience in trying to internationalize an application and make it (much) easier for my translators to contribute. Many of them are not Python gurus and the existing % syntax is clearly a common tripping point. I'm convinced that the current design of PEP 292 is right for the use cases I originally designed it for. To be generous, if the three of you disagree, then it's because you have other requirements. That's fine; maybe they're just incompatible with mine. Maybe I did a poor job of explaining how my uses cases lead to the design of PEP 292. If all that's true, then PEP 292 can't be made general enough and should be rejected, and the code should be ripped out of the standard library. Let applications use whatever is appropriate for their own uses cases. Because PEP 292 is a library addition, Python itself won't suffer in the least. The implementations you proposed won't be of any use to me. Fortunately, the archives will be replete with all the alternatives for future software archaeologists. -Barry

Barry wrote:
from a user perspective, there's no reason to make templates a sub- class of unicode, so the rest of your argument is irrelevant. instead of looking at use patterns, you're stuck defending the existing code. that's not a good way to design usable code. </F>

On Sun, 2004-09-05 at 04:26, Fredrik Lundh wrote:
from a user perspective, there's no reason to make templates a sub- class of unicode, so the rest of your argument is irrelevant.
Not true. I had a very specific reason for making Templates subclasses of unicode. Read the Internationalization section of the PEP. -Barry

Barry wrote:
Not true. I had a very specific reason for making Templates subclasses of unicode. Read the Internationalization section of the PEP.
this section? The implementation supports internationalization magic by keeping the original string value intact. In fact, all the work of the special substitution rules are implemented by overriding the __mod__() operator. However the string value of a Template (or SafeTemplate) is the string that was passed to its constructor. This approach allows a gettext-based internationalized program to use the Template instance as a lookup into the catalog; in fact gettext doesn't care that the catalog key is a Template. Because the value of the Template is the original $-string, translators also never need to use %-strings. The right thing will happen at run-time. I don't follow: if you're passing a template to gettext, do you really get a template back? if that's really the case, is being able to write "_(Template(x))" really that much of an advantage over writing "Template(_(x))" ? if that's really the case, you can still get the same effect from a template factory function, or a trivial modification of gettext. </F>

Fredrik Lundh wrote:
Templates are meant to template *text* data, so Unicode is the right choice of baseclass from a design perspective.
instead of looking at use patterns, you're stuck defending the existing code. that's not a good way to design usable code.
Perhaps I'm missing something, but where would you use Templates for templating binary data (where strings or bytes would be a more appropriate design choice) ? -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 08 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Templates are meant to template *text* data, so Unicode is the right choice of baseclass from a design perspective.
Only in Python 3.0. But even so, deriving from Unicode (or str) means the template class inherits a lot of unwanted operations. While I can see that concatenating templates probably works, slicing them or converting to lowercase etc. make no sense. IMO the standard Template class should implement a "narrow" interface, i.e. *only* the template expansion method (__mod__ or something else), so it's clear that other compatible template classes shouldn't have to implement anything besides that. This avoids the issues we have with the mapping protocol: when does an object implement enough of the mapping API to be usable? That's currently ill-defined; sometimes, __getitem__ is all you need, sometimes __contains__ is required, sometimes keys, rarely setdefault. -- --Guido van Rossum (home page: http://www.python.org/~guido/) Ask me about gmail.

Guido van Rossum wrote:
We better start early to ever reach the point of making a clear distinction between text and binary data in P3k.
Looks like it's ont even clear what templating itself should mean... you're talking about a templating interface here, not an object type, like Barry is (for the sake of making Templates compatible to i18n tools like gettext). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 08 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

M.-A. Lemburg wrote
for the sake of making Templates compatible to i18n tools like gettext).
assuming that gettext really always returns a template if you hand it a template, of course. given that the 2.4 gettext doesn't seem to map templates to templates on my machine, that there's no sign of template support in the 2.4 gettext source code, and that Barry ignored my question about this, I have to assume that the I18N argument is yet another bogus argument. </F>

The introduction of a bytes type in Python 2.5 should be a good start.
I don't know zip about i18n or gettext. But I thought we had plenty of time since Barry has offered to withdraw the PEP 292 implementation for 2.4? -- --Guido van Rossum (home page: http://www.python.org/~guido/) Ask me about gmail.

On Wed, 2004-09-08 at 22:29, Guido van Rossum wrote:
But I thought we had plenty of time since Barry has offered to withdraw the PEP 292 implementation for 2.4?
Which I will still do if we cannot reach community agreement by beta1. But lets see how the latest proposal goes over. -Barry

On Wed, 2004-09-08 at 11:08, Guido van Rossum wrote:
Except that I think in general it'll just be very convenient for Templates to /be/ unicodes. But no matter. It seems like if we make Template a simple class, it will be possible for applications to mix in Template and unicode if they want. E.g. class UTemplate(Template, unicode). If we go that route, then I agree we probably don't want to use __mod__(), but I'm not too crazy about using __call__(). "Calling a template" just seems weird to me. Besides, extrapolating, I don't think we need separate Template and SafeTemplate classes. A single Template class can have both safe and non-safe substitution methods. So, I have working code that integrates these changes, and also uses Tim's metaclass idea to provide a nice, easy-to-document pattern overloading mechanism. I chose methods substitute() and safe_substitute() because, er, that's what they do, and those names also don't interfere with existing str or unicode methods. And to make effbot and Raymond happy, it won't auto-promote to unicode if everything's an 8bit string. I will check this in and hopefully this will put the issue to bed. There will be updated unit tests, and I will update the documentation and the PEP as appropriate -- if we've reached agreement on it. -Barry

[Barry]
And to make effbot and Raymond happy, it won't auto-promote to unicode if everything's an 8bit string.
Glad to see that my happiness now ranks as a development objective ;-)
There will be updated unit tests, and I will update the documentation and the PEP as appropriate -- if we've reached agreement on it.
+1 Beautiful job. Barry asked me to bring up one remaining implementation issue for discussion on python-dev. The docs clearly state that only python identifiers are allowed as placeholders: [_A-Za-z][_A-Za-z0-9]* The challenge is that templates can be exposed to non-programmer end-users with no reason to suspect that one letter of their alphabet is different from another. So, as it stands right now, there is a usability issue with placeholder errors passing silently: >>> fechas = {u'hoy':u'lunes', u'mañana':u'martes'} >>> t = Template(u'¿Puede volver $hoy o $mañana?') >>> t.safe_substitute(fechas) u'¿Puede volver lunes o $mañana?' The substitution failed silently (no ValueError as would have occurred with $@ or a dangling $). It may be especially baffling for the user because one placeholder succeeded and the other failed without a hint of why (he can see the key in the mapping, it just won't substitute). No clue is offered that the Template was looking for $ma, a partial token, and didn't find it (the situation is even worse if it does find $ma and substitutes an unintended value). I suggest that the above should raise an error: ValueError: Invalid token $mañana on line 1, column 24 It is easily possible to detect and report such errors (see an example in nondist/sandbox/string/curry292.py). The arguments against such reporting are: * Raymond is smoking crack. End users will never make this mistake. * The docs say python identifiers only. You blew it. Tough. Not a bug. * For someone who understands exactly what they are doing, perhaps $ma is the intended placeholder -- why force them to uses braces: ${ma}ñana. In addition to the above usability issue, there is one other nit. The new invocation syntax offers us the opportunity for to also accept keyword arguments as mapping alternatives: def substitute(self, mapping=None, **kwds): if mapping is None: mapping == kwds . . . When applicable, this makes for beautiful, readable calls: t.substitute(who="Barry", what="mailmeister", when=now()) This would be a simple and nice enchancement to Barry's excellent implementation. I recommend that keyword arguments be adopted. Raymond

On Fri, 2004-09-10 at 01:50, Raymond Hettinger wrote:
Well, if I want to get other work done... :)
Cool!
It also makes it more difficult to document. IOW, right now the PEP and the documentation say that the first non-identifier character terminates the placeholder. How would you word the rules with your change?
My only problem with that is the interference that the 'mapping' argument presents. IOW, kwds can't contain 'mapping'. We could solve that in a couple of ways: 1. ignore the problem and tell people not to do that 2. change 'mapping' to something less likely to collide, such as '_mapping' or '__mapping__', and then see #1. 3. get rid of the mapping altogether and only have kwds. This would change the non-keyword invocation from mytemplate.substitute(mymapping) to mytemplate.substitute(**mymapping) A bit uglier and harder to document. Note that there's also a potential collision on 'self'. -Barry

"""Placeholders must be a valid Python identifier (containing only ASCII alphanumeric characters and an underscore). If an unbraced identifier ends with a non-ASCII alphanumeric character, such as the latin letter n with tilde in $mañana, then a ValueError is raised for the specious identifier.
My only problem with that is the interference that the 'mapping' argument presents. IOW, kwds can't contain 'mapping'.
To support a case where both a mapping and keywords are present, perhaps an auxiliary class could simplify matters: def substitute(self, mapping=None, **kwds): if mapping is None: mapping = kwds elif kwds: mapping = _altmap(kwds, mapping) . . . class _altmap: def __init__(self, primary, secondary): self.primary = primary self.secondary = secondary def __getitem__(self, key): try: return self.primary[key] except KeyError: return self.secondary[key] This matches the way keywords are used with the dict(). Raymond

Raymond Hettinger wrote:
I don't think any of this is needed. If a non-programmer is being told to use string substitution chances are someone is either going to explain it to them or there will be another set of docs to explain things in a simple way. I suspect stating exactly what a valid Python identifier contains as you did in parentheses above will be enough. -Brett

Brett C. wrote:
Also, since Barry has gone to great lengths to make Template overrideable, applications can replace the regular expression in their derived Template class when there is a need to allow for end-users inputing template strings. So, I'd suggest keeping safe_substitute relatively simple, but document the limitation and/or solution. Thanks, -Shane Holloway

[Brett]
I suspect stating exactly what a valid Python identifier contains as you did in parentheses above will be enough.
Given the template, u'¿Puede volver $hoy o $mañana?', you think $ma is an intended placeholder name and that ñ should be a delimiter just like whitespace and punctuation? If end users always follow the rules, this will never come up. If they don't, should there be error message or a silent failure? Raymond

Raymond Hettinger wrote:
No, I think Brett (and apparently nearly everybody else) thinks that such a template will not be written over the course of the next five years, except for demonstration purposes. Instead, what will be written is u'¿Puede volver $today o $tomorrow?' because the template will be a translation of the original English template, and, during translation, placeholder names must not be changed (although I have difficulties imagining possible values for today or tomorrow so that this becomes meaningful).
If end users always follow the rules, this will never come up. If they don't, should there be error message or a silent failure?
There is always a chance of a silent failure in SafeTemplates, even with this rule added - this is the purpose of SafeTemplates. With a Template, you will get a KeyError. In any case, the failure will not be completely silent, as the user will see $mañana show up in the output. My prediction is that the typical application is to use Templates, as users know very well what the placeholders are. Furthermore, the typical application will use locals/globals/vars(), or dict(key="value") to create the replacement dictionary. In this application, nobody would even think of using mañana as a key, because you can't get it into the dictionary. If this never comes up, it is better to not complicate the rules. Simple is better than complex. Regards, Martin

Martin v. Löwis wrote:
Actually, that wasn't what I was thinking, but that also works. My original thinking is that Template will throw a fit and that's fine since they didn't follow the rules.
Right, my other reason for not thinking this is a big issue. If you use SafeTemplate you will have to watch out for silent problems like this anyway. I just don't think it will be a big problem. And if people want the support they will just use a pure Unicode Template subclass (perhaps we should include that in the module?). -Brett

On Sat, 2004-09-11 at 04:39, "Martin v. Löwis" wrote:
I tend to agree, so I'd like to keep the rules as they currently stand. Your prediction is aligned with what I think the most common use cases are too. -Barry

Raymond Hettinger wrote:
so why keep the python identifier limitation? the RE engine you're using to parse the template has a concept of "alphanumeric character". just define the placeholder syntax as "one or more alphanumeric characters or under- scores" (\w+), use re.UNICODE if the template is created from a unicode string, and you're done. this doesn't mean that people *have* to use non-ASCII characters, of course. but if they do, things just work. </F>

On Fri, 2004-09-10 at 18:22, Raymond Hettinger wrote:
This matches the way keywords are used with the dict().
This isn't exactly what I was concerned about, but I agree that it's a worthwhile approach. (I'm going to accept your patch and check it in, with slight modifications.) What I was worried about was if you providing 'mapping' positionally, and kwds contained a 'mapping' key, you'll get a TypeError. I'm going to change the positional argument to '__mapping' so collisions of that kind are less likely, and will document it in libstring.tex. -Barry

M.-A. Lemburg wrote:
not true. as I've shown in SRE and ElementTree (just to give a few examples), 8-bit strings are superior for the *huge* subset of all text strings that only contain ASCII data.
8-bit strings != binary data. you clearly haven't read my other posts in this thread. please do that, instead of repeating the same bogus arguments over again. </F>

Fredrik Lundh wrote:
I've read them all and, to be honest, I don't follow your argumentation. The text interpretation of 8-bit strings is only one possible form of their interpretation. You could just as well have image data in your 8-bit string and calling .lower() on such a string is certainly going to render that image data useless. The whole point in adding Unicode to the language was to make the difference between text and binary data clear and visible at the type level. I'm not saying that you can not store text data in 8-bit strings, but that we should start to make use of the distinction between text and binary data. If we start to store text data in Unicode now and leave binary data in 8-bit strings, then the move to Unicode strings literals will be much smoother in P3k. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 08 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

M.-A. Lemburg wrote:
well, when I wrote the Unicode type, the whole point was to be able to make it easy to handle Unicode text. no more, no less.
hopefully, the P3K string design will take a lot more into account than text-vs-binary; there are many ways to represent text, and many ways to store binary data, and many usage patterns for them both. a good design should take most of this into account. (google for "stringlib" for some work I'm doing in this area) </F>

Fredrik Lundh wrote:
... and the Unicode integration made that a reality :-) In todays globalized world, the only sane way to deal with different scripts is through Unicode, which is why I believe that text data should eventually always be stored in Unicode objects - regardless of whether it takes more memory or not. (If you compare development time to prices of a few GB extra RAM, the effort needed to maintain text in non-Unicode formats simply doesn't pay off anymore.)
Ah, now I know where you're coming from :-) Shift tables don't work well in the Unicode world with its large alphabet. BTW, you might want to look at the BMS implementation I did for mxTextTools. Here's a nice reference for pattern matching: http://www-igm.univ-mlv.fr/~lecroq/string/index.html -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 08 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Marc-Andre Lemburg wrote:
This is not as obvious as it seems, because the "few GB extra RAM" is a price paid by everyone who *uses* the software. Granted, it's quite common for software to be only run ever on one or two machines in the company where it was developed, but not all software is used that way. Also: the price of "a few GB extra RAM" is not always as low as it seems. If adding 2GB means moving from 3GB to 5GB, it may mean replacing the CPU and the OS. That said, I strongly agree that all textual data should be Unicode as far as the developer is concerned; but, at least in the USA :-), it makes sense to have an optimized representation that saves space for ASCII-only text, just as we have an optimized representation for small integers. (The benefit is potentially much greater in that case, though.) -- g

"Gareth" == Gareth McCaughan <gmccaughan@synaptics-uk.com> writes:
Gareth> That said, I strongly agree that all textual data should Gareth> be Unicode as far as the developer is concerned; but, at Gareth> least in the USA :-), it makes sense to have an optimized Gareth> representation that saves space for ASCII-only text, just Gareth> as we have an optimized representation for small integers. This is _not at all_ obvious. As MAL just pointed out, if efficiency is a goal, text algorithms often need to be different for operations on texts that are dense in an 8-bit character space, vs texts that are sparse in a 16-bit or 20-bit character space. Note that that is what </F> is talking about too; he points to SRE and ElementTree. When viewed from that point of view, the subtext to </F>'s comment is "I don't want to separately maintain 8-bit versions of new text facilities to support my non-Unicode applications, I want to impose that burden on the authors of text-handling PEPs." That may very well be the best thing for Python; as </F> has done a lot of Unicode implementation for Python, he's in a good position to make such judgements. But the development costs MAL refers to are bigger than you are estimating, and will continue as long as that policy does. While I'm very sympathetic to </F>'s view that there's more than one way to skin a cat, and a good cat-handling design should account for that, and conceding his expertise, none-the-less I don't think that Python really wants to _maintain_ more than one text-processing system by default. Of course if you restrict yourself to the class of ASCII- only strings, you can do better, and of course that is a huge class of strings. But that, as such, is important only to efficiency fanatics. The question is, how often are people going to notice that when they have pure ASCII they get a 100% speedup, or that they actually can just suck that 3GB ASCII file into their 4GB memory, rather than buffering it as 3 (or 6) 2GB Unicode strings? Compare how often people are going to notice that a new facility "just works" for Japanese or Hindi. I just don't see the former being worth the extra effort, while the latter makes the "this or that" choice clear. If a single representation is enough, it had better be Unicode-based, and the others can be supported in libraries (which turn binary blobs into non-standard text objects with appropriate methods) as the need arises. -- Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.

On Friday 2004-09-10 06:38, Stephen J. Turnbull wrote:
I hope you aren't expecting me to disagree.
How do you know what I am estimating?
No, it's important to ... well, people to whom efficiency matters. There's no need for them to be fanatics.
Why is that the question, rather than "how often are people going to benefit from getting a 100% speedup when they have pure ASCII"? Or even "how often are people going to try out Python on an application that uses pure-ASCII strings, and decide to use some other language that seems to do the job much faster"?
No question that if a single representation is enough then it had better be Unicode. -- g

"Gareth" == Gareth McCaughan <gmccaughan@synaptics-uk.com> writes:
Gareth> On Friday 2004-09-10 06:38, Stephen J. Turnbull wrote: >> But [efficiency], as such, is important only to efficiency >> fanatics. Gareth> No, it's important to ... well, people to whom efficiency Gareth> matters. There's no need for them to be fanatics. If it matters just because they care, they're fanatics. If it matters because they get some other benefit (response time less than the threshold of hotice, twice as many searches per unit time, half as many boxes to serve a given load), they're not. </F>'s talk of many ways to do things "and Python should account for most of them" strikes me as fanaticism by that definition; the vast majority of developers will never deal with the special cases, or write apps that anticipate dealing with huge ASCII strings. Those costs should be borne by the developers who do, and their clients. I apologize for shoehorning that into my reply to you. >> The question is, how often are people going to notice that when >> they have pure ASCII they get a 100% speedup [...]? Gareth> Why is that the question, rather than "how often are Gareth> people going to benefit from getting a 100% speedup when Gareth> they have pure ASCII"? Because "benefit" is very subjective for _one_ person, and I don't want to even think about putting coefficients on your benefit versus mine. If the benefit is large enough, a single person will be willing to do the extra work. The question is, should all Python users and developers bear some burden to make it easier for that person to do what he needs to do? I think "notice" is something you can get consensus on. If a lot of people are _noticing_ the difference, I think that's a reasonable rule of thumb for when we might want to put "it", or facilities for making individual efforts to deal with "it" simpler, into "standard Python" at some level. If only a few people are noticing, let them become expert at dealing with it. Gareth> Or even "how often are people going to try out Python on Gareth> an application that uses pure-ASCII strings, and decide to Gareth> use some other language that seems to do the job much Gareth> faster"? See? You're now using a "notice" standard, too. I don't think that's an accident. >> I just don't see the former being worth the extra effort, while >> the latter makes the "this or that" choice clear. If a single >> representation is enough, it had better be Unicode-based, and >> the others can be supported in libraries (which turn binary >> blobs into non-standard text objects with appropriate methods) >> as the need arises. Gareth> No question that if a single representation is enough then Gareth> it had better be Unicode. Not for you, not for me, not for </F>, I'm pretty sure. The point here is that there is a reasonable way to support the others, too, but their users will have to make more effort than if it were a goal to support them in the "standard language and libraries." I think that's the way to go, and </F> thinks the opposite AFAICT. -- Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp University of Tsukuba
participants (16)
-
"Martin v. Löwis"
-
Aahz
-
Barry Warsaw
-
Brett C.
-
François Pinard
-
Fredrik Lundh
-
Gareth McCaughan
-
Guido van Rossum
-
James Y Knight
-
M.-A. Lemburg
-
Neil Hodgson
-
Raymond Hettinger
-
Raymond Hettinger
-
Shane Holloway (IEEE)
-
Stephen J. Turnbull
-
Terry Reedy