Internationalization Toolkit

Here are the features I'd like to see in a Python Internationalisation Toolkit. I'm very open to persuasion about APIs and how to do it, but this is roughly the functionality I would have wanted for the last year (see separate post "Internationalization Case Study"): Built-in types: --------------- "Unicode String" and "Normal String". The normal string is can hold all 256 possible byte values and is analogous to java's Byte Array - in other words an ordinary Python string. Unicode strings iterate (and are manipulated) per character, not per byte. You knew that already. To manipulate anything in a funny encoding, you convert it to Unicode, manipulate it there, then convert it back. Easy Conversions ---------------------- This is modelled on Java which I think has it right. When you construct a Unicode string, you may supply an optional encoding argument. I'm not bothered if conversion happens in a global function, a constructor method or whatever. MyUniString = ToUnicode('hello') # assumes ASCII MyUniString = ToUnicode('pretend this is Japanese', 'ShiftJIS') #specified The converse applies when converting back. The encoding designators should agree with Java. If data is encountered which is not valid for the encoding, there are several strategies, and it would be nice if they could be specified explicitly: 1. replace offending characters with a question mark 2. try to recover intelligently (possible in some cases) 3. raise an exception A 'Unicode' designator is needed which performs a dummy conversion. File Opening: --------------- It should be possible to work with files as we do now - just streams of binary data. It should also be possible to read, say, a file of locally endoded addresses into a Unicode string. e.g. open(myfile, 'r', 'ShiftJIS'). It should also be possible to open a raw Unicode file and read the bytes into ordinary Python strings, or Unicode strings. In this case one needs to watch out for the byte-order marks at the beginning of the file. Not sure of a good API to do this. We could have OrdinaryFile objects and UnicodeFile objects, or proliferate the arguments to 'open. Doing the Conversions ---------------------------- All conversions should go through Unicode as the central point. Here is where we can start to define the territory. Some conversions are algorithmic, some are lookups, many are a mixture with some simple state transitions (e.g. shift characters to denote switches from double-byte to single-byte). I'd like to see an 'encoding engine' modelled on something like mxTextTools - a state machine with a few simple actions, effectively a mini-language for doing simple operations. Then a new encoding can be added in a data-driven way, and still go at C-like speeds. Making this open and extensible (and preferably not needing to code C to do it) is the only way I can see to get a really good solid encodings library. Not all encodings need go in the standard distribution, but all should be downloadable from www.python.org. A generalized two-byte-to-two-byte mapping is 128kb. But there are compact forms which can reduce these to a few kb, and also make the data intelligible. It is obviously desirable to store stuff compactly if we can unpack it fast. Typed Strings ---------------- When you are writing data conversion tools to sit in the middle of a bunch of databases, you could save a lot of grief with a string that knows its encoding. What follows could be done as a Python wrapper around something ordinary strings rather than as a new type, and thus need not be part of the language. This is analogous to Martin Fowler's Quantity pattern in Analysis Patterns, where a number knows its units and you cannot add dollars and pounds accidentally. These would do implicit conversions; and they would stop you assigning or confusing differently encoded strings. They would also validate when constructed. 'Typecasting' would be allowed but would require explicit code. So maybe something like... options EncodingError that in this case the string is compatible. Going Deeper ---------------- The project I describe involved many more issues than just a straight conversion. I envisage an encodings package or module which power users could get at directly. We have be able to answer the questions: 'is string X a valid instance of encoding Y?' 'is string X nearly a valid instance of encoding Y, maybe with a little corruption, or is it something totally different?' - this one might be a task left to a programmer, but the toolkit should help where it can. 'can string X be converted from encoding Y to encoding Z without loss of data? If not, exactly what will get trashed' ? This is a really useful utility. More generally, I want tools to reason about character sets and encodings. I have 'Character Set' and 'Character Mapping' classes - very app-specific and proprietary - which let me express and answer questions about whether one character set is a superset of another and reason about round trips. I'd like to do these properly for the toolkit. They would need some C support for speed, but I think they could still be data driven. So we could have an Endoding object which could be pickled, and we could keep a directory full of them as our database. There might actually be two encoding objects - one for single-byte, one for multi-byte, with the same API. There are so many subtle differences between encodings (even within the Shift-JIS family) - company X has ten extra characters, and that is technically a new encoding. So it would be really useful to reason about these and say 'find me all JIS-compatible encodings', or 'report on the differences between Shift-JIS and 'cp932ms'. GUI Issues ------------- The new Pythonwin breaks somewhat on Japanese - editor windows are fine but console output is show as single-byte garbage. I will try to evaluate IDLE on a Japanese test box this week. I think these two need to work for double-byte languages for our credibility. Verifiability and printing ----------------------------- We will need to prove it all works. This means looking at text on a screen or on paper. A really wicked demo utility would be a GUI which could open files and convert encodings in an editor window or spreadsheet window, and specify conversions on copy/paste. If it could save a page as HTML (just an encoding tag and data between <PRE> tags, then we could use Netscape/IE for verification. Better still, a web server demo could convert on python.org and tag the pages appropriately - browsers support most common encodings. All the encoding stuff is ultimately a bit meaningless without a way to display a character. I am hoping that PDF and PDFgen may add a lot of value here. Adobe (and Ken Lunde) have spent years coming up with a general architecture for this stuff in PDF. Basically, the multi-byte fonts they use are encoding independent, and come with a whole bunch of mapping tables. So I can ask for the same Japanese font in any of about ten encodings - font name is a combination of face name and encoding. The font itself does the remapping. They make available downloadable font packs for Acrobat 4.0 for most languages now; these are good places to raid for building encoding databases. It also means that I can write a Python script to crank out beautiful-looking code page charts for all of our encodings from the database, and input and output to regression tests. I've done it for Shift-JIS at Fidelity, and would have to rewrite it once I am out of here. But I think that some good graphic design here would lead to a product that blows people away - an encodings library that can print out its own contents for viewing and thus help demonstrate its own correctness (or make errors stick out like a sore thumb). Am I mad? Have I put you off forever? What I outline above would be a serious project needing months of work; I'd be really happy to take a role, if we could find sponsors for the project. But I believe we could define the standard for years to come. Furthermore, it would go a long way to making Python the corporate choice for data cleaning and transformation - territory I think we should own. Regards, Andy Robinson Robinson Analytics Ltd. ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com

Andy, Thanks a bundle for your case study and your toolkit proposal. It's interesting that you haven't touched upon internationalization of user interfaces (dialog text, menus etc.) -- that's a whole nother can of worms. Marc-Andre Lemburg has a proposal for work that I'm asking him to do (under pressure from HP who want Python i18n badly and are willing to pay!): http://starship.skyport.net/~lemburg/unicode-proposal.txt I think his proposal will go a long way towards your toolkit. I hope to hear soon from anybody who disagrees with Marc-Andre's proposal, because without opposition this is going to be Python 1.6's offering for i18n... (Together with a new Unicode regex engine by /F.) One specific question: in you discussion of typed strings, I'm not sure why you couldn't convert everything to Unicode and be done with it. I have a feeling that the answer is somewhere in your case study -- maybe you can elaborate? --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum writes:
The proposal seems reasonable to me.
(Together with a new Unicode regex engine by /F.)
This is good news! Would it be a from-scratch regex implementation, or would it be an adaptation of an existing engine? Would it involve modifications to the existing re module, or a completely new unicodere module? (If, unlike re.py, it has POSIX longest-match semantics, that would pretty much settle the question.) -- A.M. Kuchling http://starship.python.net/crew/amk/ All around me darkness gathers, fading is the sun that shone, we must speak of other matters, you can be me when I'm gone... -- The train's clattering, in SANDMAN #67: "The Kindly Ones:11"

[AMK]
The proposal seems reasonable to me.
Thanks. I really hope that this time we can move forward united...
It's from scratch, and I believe it's got Perl style, not POSIX style semantics -- per Tim Peters' recommendations. Do we need to open the discussion again? It involves a redone re module (supporting Unicode as well as 8-bit), but its API could be unchanged. /F does the parsing and compilation in Python, only the matching engine is in C -- not sure how that impacts performance, but I imagine with aggressive caching it would be okay. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum writes:
No, no; I'm actually happier with Perl-style, because it's far better documented and familiar to people. Worse *is* better, after all. My concern is simply that I've started translating re.py into C, and wonder how this affects the translation. This isn't a pressing issue, because the C version isn't finished yet.
Can I get my paws on a copy of the modified re.py to see what ramifications it has, or is this all still an unreleased work-in-progress? Doing the compilation in Python is a good idea, and will make it possible to implement alternative syntaxes. I would have liked to make it possible to generate PCRE bytecodes from Python, but what stopped me is the chance of bogus bytecode causing the engine to dump core, loop forever, or some other nastiness. (This is particularly important for code that uses rexec.py, because you'd expect regexes to be safe.) Fixing the engine to be stable when faced with bad bytecodes appears to require many additional checks that would slow down the common case of correct code, which is unappealing. -- A.M. Kuchling http://starship.python.net/crew/amk/ Anybody else on the list got an opinion? Should I change the language or not? -- Guido van Rossum, 28 Dec 91

On Tue, 9 Nov 1999, Andrew M. Kuchling wrote:
I would concur with the preference for Perl-style semantics. Aside from the issue of consistency with other scripting languages, i think it's easier to predict the behaviour of these semantics. You can run the algorithm in your head, and try the backtracking yourself. It's good for the algorithm to be predictable and well understood.
Doing the compilation in Python is a good idea, and will make it possible to implement alternative syntaxes.
Also agree. I still have some vague wishes for a simpler, more readable (more Pythonian?) way to express patterns -- perhaps not as powerful as full regular expressions, but useful for many simpler cases (an 80-20 solution). -- ?!ng

"AMK" == Andrew M Kuchling <akuchlin@mems-exchange.org> writes:
AMK> No, no; I'm actually happier with Perl-style, because it's AMK> far better documented and familiar to people. Worse *is* AMK> better, after all. Plus, you can't change re's semantics and I think it makes sense if the Unicode engine is as close semantically as possible to the existing engine. We need to be careful not to worsen performance for 8bit strings. I think we're already on the edge of acceptability w.r.t. P*** and hopefully we can /improve/ performance here. MAL's proposal seems quite reasonable. It would be excellent to see these things done for Python 1.6. There's still some discussion on supporting internationalization of applications, e.g. using gettext but I think those are smaller in scope. -Barry

Barry A. Warsaw writes: (in relation to support for Unicode regexes)
I don't think that will be a problem, given that the Unicode engine would be a separate C implementation. A bit of 'if type(strg) == UnicodeType' in re.py isn't going to cost very much speed. (Speeding up PCRE -- that's another question. I'm often tempted to rewrite pcre_compile to generate an easier-to-analyse parse tree, instead of its current complicated-but-memory-parsimonious compiler, but I'm very reluctant to introduce a fork like that.) -- A.M. Kuchling http://starship.python.net/crew/amk/ The world does so well without me, that I am moved to wish that I could do equally well without the world. -- Robertson Davies, _The Diary of Samuel Marchbanks_

Andrew M. Kuchling <akuchlin@mems-exchange.org> wrote:
any special pattern constructs that are in need of per- formance improvements? (compared to Perl, that is). or maybe anyone has an extensive performance test suite for perlish regular expressions? (preferrably based on how real people use regular expressions, not only on things that are known to be slow if not optimized) </F>

[Cc'ed to the String-SIG; sheesh, what's the point of having SIGs otherwise?] Fredrik Lundh writes:
any special pattern constructs that are in need of per- formance improvements? (compared to Perl, that is).
In the 1.5 source tree, I think one major slowdown is coming from the malloc'ed failure stack. This was introduced in order to prevent an expression like (x)* from filling the stack when applied to a string contained 50,000 'x' characters (hence 50,000 recursive function calls). I'd like to get rid of this stack because it's slow and requires much tedious patching of the upstream PCRE.
Friedl's book describes several optimizations which aren't implemented in PCRE. The problem is that PCRE never builds a parse tree, and parse trees are easy to analyse recursively. Instead, PCRE's functions actually look at the compiled byte codes (for example, look at find_firstchar or is_anchored in pypcre.c), but this makes analysis functions hard to write, and rearranging the code near-impossible. -- A.M. Kuchling http://starship.python.net/crew/amk/ I didn't say it was my fault. I said it was my responsibility. I know the difference. -- Rose Walker, in SANDMAN #60: "The Kindly Ones:4"

[Andrew M. Kuchling]
This is wonderfully & ironically Pythonic. That is, the Python compiler itself goes straight to byte code, and the optimization that's done works at the latter low level. Luckily <wink>, very little optimization is attempted, and what's there only replaces one bytecode with another of the same length. If it tried to do more, it would have to rearrange the code ... the-more-things-differ-the-more-things-don't-ly y'rs - tim

(a copy was sent to comp.lang.python by mistake; sorry for that). Andrew M. Kuchling <akuchlin@mems-exchange.org> wrote:
a slightly hairer design issue is what combinations of pattern and string the new 're' will handle. the first two are obvious: ordinary pattern, ordinary string unicode pattern, unicode string but what about these? ordinary pattern, unicode string unicode pattern, ordinary string "coercing" patterns (i.e. recompiling, on demand) seem to be a somewhat risky business ;-) </F>

[Guido, on "a new Unicode regex engine by /F"]
No, but I get to whine just a little <wink>: I didn't recommend either approach. I asked many futile questions about HP's requirements, and sketched implications either way. If HP *has* a requirement wrt POSIX-vs-Perl, it would be good to find that out before it's too late. I personally prefer POSIX semantics -- but, as Andrew so eloquently said, worse is better here; all else being equal it's best to follow JPython's Perl-compatible re lead. last-time-i-ever-say-what-i-really-think<wink>-ly y'rs - tim

Mark Hammond wrote:
Well almost... it depends on the current value of <default encoding>. If it's UTF8 and you only use normal ASCII characters the above is indeed true, but UTF8 can go far beyond ASCII and have up to 3 bytes per character (for UCS2, even more for UCS4). With <default encoding> set to other exotic encodings this is likely to fail though.
"U" is meant to simplify checks for Unicode objects, much like "S". It returns a reference to the object. Auto-conversions are not possible due to this, because they would create new objects which don't get properly garbage collected later on. Another problem is that Unicode types differ between platforms (MS VCLIB uses 16-bit wchar_t, while GLIBC2 uses 32-bit wchar_t). Depending on the internal format of Unicode objects this could mean calling different conversion APIs. BTW, I'm still not too sure about the underlying internal format. The problem here is that Unicode started out as 2-byte fixed length representation (UCS2) but then shifted towards a 4-byte fixed length reprensetation known as UCS4. Since having 4 bytes per character is hard sell to customers, UTF16 was created to stuff the UCS4 code points (this is how character entities are called in Unicode) into 2 bytes... with a variable length encoding. Some platforms that started early into the Unicode business such as the MS ones use UCS2 as wchar_t, while more recent ones (e.g. the glibc2 on Linux) use UCS4 for wchar_t. I haven't yet checked in what ways the two are compatible (I would suspect the top bytes in UCS4 being 0 for UCS2 codes), but would like to hear whether it wouldn't be a better idea to use UTF16 as internal format. The latter works in 2 bytes for most characters and conversion to UCS2|4 should be fast. Still, conversion to UCS2 could fail. The downside of using UTF16: it is a variable length format, so iterations over it will be slower than for UCS4. Simply sticking to UCS2 is probably out of the question, since Unicode 3.0 requires UCS4 and we are targetting Unicode 3.0. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Wed, 10 Nov 1999, M.-A. Lemburg wrote:
... Well almost... it depends on the current value of <default encoding>.
Default encodings are kind of nasty when they can be altered. The same problem occurred with import hooks. Only one can be present at a time. This implies that modules, packages, subsystems, whatever, cannot set a default encoding because something else might depend on it having a different value. In the end, nobody uses the default encoding because it is unreliable, so you end up with extra implementation/semantics that aren't used/needed. Have you ever noticed how Python modules, packages, tools, etc, never define an import hook? I'll bet nobody ever monkeys with the default encoding either... I say axe it and say "UTF-8" is the fixed, default encoding. If you want something else, then do that explicitly.
Exactly the reason to avoid wchar_t.
History is basically irrelevant. What is the situation today? What is in use, and what are people planning for right now?
Bzzt. May as well go with UTF-8 as the internal format, much like Perl is doing (as I recall). Why go with a variable length format, when people seem to be doing fine with UCS-2? Like I said in the other mail note: two large platforms out there are UCS-2 based. They seem to be doing quite well with that approach. If people truly need UCS-4, then they can work with that on their own. One of the major reasons for putting Unicode into Python is to increase/simplify its ability to speak to the underlying platform. Hey! Guess what? That generally means UCS2. If we didn't need to speak to the OS with these Unicode values, then people can work with the values entirely in Python, PyUnicodeType-be-damned. Are we digging a hole for ourselves? Maybe. But there are two other big platforms that have the same hole to dig out of *IF* it ever comes to that. I posit that it won't be necessary; that the people needing UCS-4 can do so entirely in Python. Maybe we can allow the encoder to do UCS-4 to UTF-8 encoding and vice-versa. But: it only does it from String to String -- you can't use Unicode objects anywhere in there.
Oh? Who says? Cheers, -g -- Greg Stein, http://www.lyra.org/

Greg Stein wrote: [MAL:]
Ehm, pardon me for asking - what is the brief rationale for selecting UCS2/4, or whetever it ends up being, over UTF8? I couldn't find a discussion in the last months of the string SIG, was this decided upon and frozen long ago? I'm not trying to re-open a can of worms, just to understand. -- Jean-Claude

Jean-Claude Wippler wrote:
UCS-2 is the native format on major platforms (meaning straight fixed length encoding using 2 bytes), ie. interfacing between Python's Unicode object and the platform APIs will be simple and fast. UTF-8 is short for ASCII users, but imposes a performance hit for the CJK (Asian character sets) world, since UTF8 uses *variable* length encodings. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Wed, 10 Nov 1999, Jean-Claude Wippler wrote:
Try sometime last year :-) ... something like July thru September as I recall. Things will be a lot faster if we have a fixed-size character. Variable length formats like UTF-8 are a lot harder to slice, search, etc. Also, (IMO) a big reason for this new type is for interaction with the underlying OS/platform. I don't know of any platforms right now that really use UTF-8 as their Unicode string representation (meaning we'd have to convert back/forth from our UTF-8 representation to talk to the OS). Cheers, -g -- Greg Stein, http://www.lyra.org/

[ Greg Stein]
The initial byte of any UTF-8 encoded character never appears in a *non*-initial position of any UTF-8 encoded character. Which means searching is not only tractable in UTF-8, but also that whatever optimized 8-bit clean string searching routines you happen to have sitting around today can be used as-is on UTF-8 encoded strings. This is not true of UCS-2 encoded strings (in which "the first" byte is not distinguished, so 8-bit search is vulnerable to finding a hit starting "in the middle" of a character). More, to the extent that the bulk of your text is plain ASCII, the UTF-8 search will run much faster than when using a 2-byte encoding, simply because it has half as many bytes to chew over. UTF-8 is certainly slower for random-access indexing, including slicing. I don't know what "etc" means, but if it follows the pattern so far, sometimes it's faster and sometimes it's slower <wink>.
No argument here.

Greg Stein <gstein@lyra.org> wrote:
Have you ever noticed how Python modules, packages, tools, etc, never define an import hook?
hey, didn't MAL use one in one of his mx kits? ;-)
I say axe it and say "UTF-8" is the fixed, default encoding. If you want something else, then do that explicitly.
exactly. modes are evil. python is not perl. etc.
last time I checked, there were no characters (even in the ISO standard) outside the 16-bit range. has that changed? </F>

Fredrik Lundh wrote:
Not yet, but I will unless my last patch ("walk me up, Scotty" - import) goes into the core interpreter.
But a requirement by the customer... they want to be able to set the locale on a per thread basis. Not exactly my preference (I think all locale settings should be passed as parameters, not via globals).
No, but people are already thinking about it and there is a defined range in the >16-bit area for private encodings (F0000..FFFFD and 100000..10FFFD). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Marc writes:
Sure - that is what this customer wants, but we need to be clear about the "best thing" for Python generally versus what this particular client wants. For example, if we went with UTF-8 as the only default encoding, then HP may be forced to use a helper function to perform the conversion, rather than the built-in functions. This helper function can use TLS (in Python) to store the encoding. At least it is localized. I agree that having a default encoding that can be changed is a bad idea. It may make 3 line scripts that need to print something easier to work with, but at the cost of reliability in large systems. Kinda like the existing "locale" support, which is thread specific, and is well known to cause these sorts of problems. The end result is that in your app, you find _someone_ has changed the default encoding, and some code no longer works. So the solution is to change the default encoding back, so _your_ code works again. You just know that whoever it was that changed the default encoding in the first place is now going to break - but what else can you do? Having a fixed, default encoding may make life slightly more difficult when you want to work primarily in a different encoding, but at least your system is predictable and reliable. Mark.
______________________________________________________________________

Tim Peters wrote:
See my other post on the subject... Note that if we make UTF-8 the standard encoding, nearly all special Latin-1 characters will produce UTF-8 errors on input and unreadable garbage on output. That will probably be unacceptable in Europe. To remedy this, one would *always* have to use u.encode('latin-1') to get readable output for Latin-1 strings repesented in Unicode. I'd rather see this happen the other way around: *always* explicitly state the encoding you want in case you rely on it, e.g. write file.write(u.encode('utf-8')) instead of file.write(u) # let's hope this goes out as UTF-8... Using the <default encoding> as site dependent setting is useful for convenience in those cases where the output format should be readable rather than parseable. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[MAL]
I think it's time for the Europeans to pronounce on what's acceptable in Europe. To the limited extent that I can pretend I'm Eurpoean, I'm happy with Guido's rebind-stdin/stdout-in-PYTHONSTARTUP idea.
By the same argument, those pesky Europeans who are relying on Latin-1 should write file.write(u.encode('latin-1')) instead of file.write(u) # let's hope this goes out as Latin-1
Well, "convenience" is always the argument advanced in favor of modes. Conflicts and nasty intermittent bugs are always the result. The latter will happen under Guido's idea too, as various careless modules rebind stdin & stdout to their own ideas of what "the proper" encoding should be. But at least the blame doesn't fall on the core language then <0.3 wink>. Since there doesn't appear to be anything (either or good or bad) you can do (or avoid) by using Guido's scheme instead of magical core thread state, there's no *need* for the latter. That is, it can be done with a user-level API without involving the core.

Tim Peters wrote:
Agreed.
Right.
Dito :-) I have nothing against telling people to take care about the problem in user space (meaning: not done by the core interpreter) and I'm pretty sure that HP will agree on this too, provided we give them the proper user space tools like file wrappers et al. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Mark Hammond wrote:
I think the discussion on this is getting a little too hot. The point is simply that the option of changing the per-thread default encoding is there. You are not required to use it and if you do you are on your own when something breaks. Think of it as a HP specific feature... perhaps I should wrap the code in #ifdefs and leave it undocumented. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Really - I see it as moving to a rational consensus that doesnt support the proposal in this regard. I see no heat in it at all. Im sorry if you saw my post or any of the followups as "emotional", but I certainly not getting passionate about this. I dont see any of this as affecting me personally. I believe that I can replace my Unicode implementation with this either way we go. Just because a we are trying to get it right doesnt mean we are getting heated.
Hrm - Im having serious trouble following your logic here. If make _any_ assumptions about a default encoding, I am in danger of breaking. I may not choose to change the default, but as soon as _anyone_ does, unrelated code may break. I agree that I will be "on my own", but I wont necessarily have been the one that changed it :-( The only answer I can see is, as you suggest, to ignore the fact that there is _any_ default. Always specify the encoding. But obviously this is not good enough for HP:
That would work - just ensure that no standard Python has those #ifdefs turned on :-) I would be sorely dissapointed if the fact that HP are throwing money for this means they get every whim implemented in the core language. Imagine the outcry if it were instead MS' money, and you were attempting to put an MS spin on all this. Are you writing a module for HP, or writing a module for Python that HP are assisting by providing some funding? Clear difference. IMO, it must also be seen that there is a clear difference. Maybe Im missing something. Can you explain why it is good enough everyone else to be required to assume there is no default encoding, but HP get their thread specific global? Are their requirements greater than anyone elses? Is everyone else not as important? What would you, as a consultant, recommend to people who arent HP, but have a similar requirement? It would seem obvious to me that HPs requirement can be met in "pure Python", thereby keeping this out of the core all together... Mark.

[per-thread defaults] C'mon guys, hasn't anyone ever played consultant before? The idea is obviously brain-dead. OTOH, they asked for it specifically, meaning they have some assumptions about how they think they're going to use it. If you give them what they ask for, you'll only have to fix it when they realize there are other ways of doing things that don't work with per-thread defaults. So, you find out why they think it's a good thing; you make it easy for them to code this way (without actually using per-thread defaults) and you don't make a fuss about it. More than likely, they won't either. "requirements"-are-only-useful-as-clues-to-the-objectives- behind-them-ly y'rs - Gordon

Mark Hammond wrote:
Naa... with "heated" I meant the "HP wants this, HP wants that" side of things. We'll just have to wait for their answer on this one.
Sure there are some very subtile dangers in setting the default to anything other than the default ;-) For some this risk may be worthwhile taking, for others not. In fact, in large projects I would never take such a risk... I'm sure we can get this message across to them.
Again, all I can try is convince them of not really needing settable default encodings. <IMO> Since this is the first time a Python Consortium member is pushing development, I think we can learn a lot here. For one, it should be clear that money doesn't buy everything, OTOH, we cannot put the whole thing at risk just because of some minor disagreement that cannot be solved between the parties. The standard solution for the latter should be a customized Python interpreter. </IMO> -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
hehe... funny you mention this. Go read the Consortium docs. Last time that I read them, there are no "parties" to reach consensus. *Every* technical decision regarding the Python language falls to the Technical Director (Guido, of course). I looked. I found nothing that can override the T.D.'s decisions and no way to force a particular decision. Guido is still the Benevolent Dictator :-) Cheers, -g p.s. yes, there is always the caveat that "sure, Guido has final say" but "Al can fire him at will for being too stubborn" :-) ... but hey, Guido's title does have the word Benevolent in it, so things are cool... -- Greg Stein, http://www.lyra.org/

Greg Stein wrote:
Sure, but have you considered the option of a member simply bailing out ? HP could always stop funding Unicode integration. That wouldn't help us either...
-- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
I'm not that dumb... come on. That was my whole point about "Benevolent" below... Guido is a fair and reasonable Dictator... he wouldn't let that happen.
Cheers, -g -- Greg Stein, http://www.lyra.org/

On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
It's a lot easier to just never provide the rope (per-thread default encodings) in the first place. If the feature exists, then it will be used. Period. Try to get the message across until you're blue in the face, but it would be used. Anyhow... discussion is pretty moot until somebody can state that it is/isn't a "real requirement" and/or until The Guido takes a position. Cheers, -g -- Greg Stein, http://www.lyra.org/

On Thu, 11 Nov 1999, Mark Hammond wrote:
Ha! I was getting ready to say exactly the same thing. Are building Python for a particular customer, or are we building it to Do The Right Thing? I've been getting increasingly annoyed at "well, HP says this" or "HP wants that." I'm ecstatic that they are a Consortium member and are helping to fund the development of Python. However, if that means we are selling Python's soul to corporate wishes rather than programming and design ideals... well, it reduces my enthusiasm :-)
Yes! Yes! Example #2. My first example (import hooks) was shrugged off by some as "well, nobody uses those." Okay, maybe people don't use them (but I believe that is *because* of this kind of problem). In Mark's example, however... this is a definite problem. I ran into this when I was building some code for Microsoft Site Server. IIS was setting a different locale on my thread -- one that I definitely was not expecting. All of a sudden, strlwr() no longer worked as I expected -- certain characters didn't get lower-cased, so my dictionary lookups failed because the keys were not all lower-cased. Solution? Before passing control from C++ into Python, I set the locale to the default locale. Restored it on the way back out. Extreme measures, and costly to do, but it had to be done. I think I'll pick up Fredrik's phrase here... (chanting) "Modes Are Evil!" "Modes Are Evil!" "Down with Modes!" :-)
*bing* I'm with Mark on this one. Global modes and state are a serious pain when it comes to developing a system. Python is very amenable to utility functions and classes. Any "customer" can use a utility function to manually do the encoding according to a per-thread setting stashed in some module-global dictionary (map thread-id to default-encoding). Done. Keep it out of the interpreter... Cheers, -g -- Greg Stein, http://www.lyra.org/

On Thu, 11 Nov 1999, Greg Stein wrote:
What about just explaining the rationale for the default-less point of view to whoever is in charge of this at HP and see why they came up with their rationale in the first place? They might have a good reason, or they might be willing to change said requirement. --david

Damn, you're smooth... maybe you should have run for SF Mayor... :-) On Wed, 10 Nov 1999, David Ascher wrote:
-- Greg Stein, http://www.lyra.org/

[/F]
last time I checked, there were no characters (even in the ISO standard) outside the 16-bit range. has that changed?
[MAL]
Over the decades I've developed a rule of thumb that has never wound up stuck in my ass <wink>: If I engineer code that I expect to be in use for N years, I make damn sure that every internal limit is at least 10x larger than the largest I can conceive of a user making reasonable use of at the end of those N years. The invariable result is that the N years pass, and fewer than half of the users have bumped into the limit <0.5 wink>. At the risk of offending everyone, I'll suggest that, qualitatively speaking, Unicode is as Eurocentric as ASCII is Anglocentric. We've just replaced "256 characters?! We'll *never* run out of those!" with 64K. But when Asian languages consume them 7K at a pop, 64K isn't even in my 10x comfort range for some individual languages. In just a few months, Unicode 3 will already have used up > 56K of the 64K slots. As I understand it, UTF-16 "only" adds 1M new code points. That's in my 10x zone, for about a decade. predicting-we'll-live-to-regret-it-either-way-ly y'rs - tim

Tim Peters wrote:
If HP approves, I'd propose to use UTF-16 as if it were UCS-2 and signal failure of this assertion at Unicode object construction time via an exception. That way we are within the standard, can use reasonably fast code for Unicode manipulation and add those extra 1M character at a later stage. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[MAL]
I think this is reasonable. Using UTF-8 internally is also reasonable, and if it's being rejected on the grounds of supposed slowness, that deserves a closer look (it's an ingenious encoding scheme that works correctly with a surprising number of existing 8-bit string routines as-is). Indexing UTF-8 strings is greatly speeded by adding a simple finger (i.e., store along with the string an index+offset pair identifying the most recent position indexed to -- since string indexing is overwhelmingly sequential, this makes most indexing constant-time; and UTF-8 can be scanned either forward or backward from a random internal point because "the first byte" of each encoding is recognizable as such). I expect either would work well. It's at least curious that Perl and Tcl both went with UTF-8 -- does anyone think they know *why*? I don't. The people here saying UCS-2 is the obviously better choice are all from the Microsoft camp <wink>. It's not obvious to me, but then neither do I claim that UTF-8 is obviously better.

Tim Peters wrote:
Here are some arguments for using the proposed UTF-16 strategy instead: · all characters have the same length; indexing is fast · conversion APIs to platform dependent wchar_t implementation are fast because they either can simply copy the content or extend the 2-bytes to 4 byte · UTF-8 needs 2 bytes for all the compound Latin-1 characters (e.g. u with two dots) which are used in many non-English languages · from the Unicode Consortium FAQ: "Most Unicode APIs are using UTF-16." Besides, the Unicode object will have a buffer containing the <default encoding> representation of the object, which, if all goes well, will always hold the UTF-8 value. RE engines etc. can then directly work with this buffer.
-- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

<rant> over my dead body, that one... (fwiw, over the last 20 years, I've implemented about a dozen image processing libraries, supporting loads of pixel layouts and file formats. one important lesson from that is to stick to a single internal representation, and let the application programmers build their own layers if they need to speed things up -- yes, they're actually happier that way. and text strings are not that different from pixel buffers or sound streams or scientific data sets, after all...) (and sticks and modes will break your bones, but you know that...)
RE engines etc. can then directly work with this buffer.
sidebar: the RE engine that's being developed for this project can handle 8-bit, 16-bit, and (optionally) 32-bit text buffers. a single compiled expression can be used with any character size, and performance is about the same for all sizes (at least on any decent cpu).
(hey, I'm not a microsofter. but I've been writing "i/o libraries" for various "object types" all my life, so I do have strong preferences on what works, and what doesn't... I use Python for good reasons, you know ;-) </rant> thanks. I feel better now. </F>

Fredrik Lundh wrote:
Such a buffer is needed to implement "s" and "s#" argument parsing. It's a simple requirement to support those two parsing markers -- there's not much to argue about, really... unless, of course, you want to give up Unicode object support for all APIs using these parsers. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Fredrik Lundh wrote:
If we don't add that support, lot's of existing APIs won't accept Unicode object instead of strings. While it could be argued that automatic conversion to UTF-8 is not transparent enough for the user, the other solution of using str(u) everywhere would probably make writing Unicode-aware code a rather clumsy task and introduce other pitfalls, since str(obj) calls PyObject_Str() which also works on integers, floats, etc. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
No no no... "s" and "s#" are NOT SUPPOSED TO return a UTF-8 encoding. They are supposed to return the raw bytes. If a caller wants 8-bit characters, then that caller will use "t#". If you want to argue for that separate, encoded buffer, then argue for it for support for the "t#" format. But do NOT say that it is needed for "s#" which simply means "give me some bytes." -g -- Greg Stein, http://www.lyra.org/

Greg Stein wrote:
[I've waited quite some time for you to chime in on this one ;-)] Let me summarize a bit on the general ideas behind "s", "s#" and the extra buffer: First, we have a general design question here: should old code become Unicode compatible or not. As I recall the original idea about Unicode integration was to follow Perl's idea to have scripts become Unicode aware by simply adding a 'use utf8;'. If this is still the case, then we'll have to come with a resonable approach for integrating classical string based APIs with the new type. Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g. the Latin-1 folks) which has some very nice features (see http://czyborra.com/utf/ ) and which is a true extension of ASCII, this encoding seems best fit for the purpose. However, one should not forget that UTF-8 is in fact a variable length encoding of Unicode characters, that is up to 3 bytes form a *single* character. This is obviously not compatible with definitions that explicitly state data to be using a 8-bit single character encoding, e.g. indexing in UTF-8 doesn't work like it does in Latin-1 text. So if we are to do the integration, we'll have to choose argument parser markers that allow for multi byte characters. "t#" does not fall into this category, "s#" certainly does, "s" is argueable. Also note that we have to watch out for embedded NULL bytes. UTF-16 has NULL bytes for every character from the Latin-1 domain. If "s" were to give back a pointer to the internal buffer which is encoded in UTF-16, you would loose data. UTF-8 doesn't have this problem, since only NULL bytes map to (single) NULL bytes. Now Greg would chime in with the buffer interface and argue that it should make the underlying internal format accessible. This is a bad idea, IMHO, since you shouldn't really have to know what the internal data format is. Defining "s#" to return UTF-8 data does not only make "s" and "s#" return the same data format (which should always be the case, IMO), but also hides the internal format from the user and gives him a reliable cross-platform data representation of Unicode data (note that UTF-8 doesn't have the byte order problems of UTF-16). If you are still with, let's look at what "s" and "s#" do: they return pointers into data areas which have to be kept alive until the corresponding object dies. The only way to support this feature is by allocating a buffer for just this purpose (on the fly and only if needed to prevent excessive memory load). The other options of adding new magic parser markers or switching to more generic one all have one downside: you need to change existing code which is in conflict with the idea we started out with. So, again, the question is: do we want this magical intergration or not ? Note that this is a design question, not one of memory consumption... -- Ok, the above covered Unicode -> String conversion. Mark mentioned that he wanted the other way around to also work in the same fashion, ie. automatic String -> Unicode conversion. This could also be done in the same way by interpreting the string as UTF-8 encoded Unicode... but we have the same problem: where to put the data without generating new intermediate objects. Since only newly written code will use this feature there is a way to do this though: PyArg_ParseTuple(args,"s#",&utf8,&len); If your C API understands UTF-8 there's nothing more to do, if not, take Greg's option 3 approach: PyArg_ParseTuple(args,"O",&obj); unicode = PyUnicode_FromObject(obj); ... Py_DECREF(unicode); Here PyUnicode_FromObject() will return a new reference if obj is an Unicode object or create a new Unicode object by interpreting str(obj) as UTF-8 encoded string. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 48 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

I think I have a reasonable grasp of the issues here, even though I still haven't read about 100 msgs in this thread. Note that t# and the charbuffer addition to the buffer API were added by Greg Stein with my support; I'll attempt to reconstruct our thinking at the time... [MAL]
Let me summarize a bit on the general ideas behind "s", "s#" and the extra buffer:
I think you left out t#.
I've never heard of this idea before -- or am I taking it too literal? It smells of a mode to me :-) I'd rather live in a world where Unicode just works as long as you use u'...' literals or whatever convention we decide.
Yes, especially if we fix the default encoding as UTF-8. (I'm expecting feedback from HP on this next week, hopefully when I see the details, it'll be clear that don't need a per-thread default encoding to solve their problems; that's quite a likely outcome. If not, we have a real-world argument for allowing a variable default encoding, without carnage.)
Sure, but where in current Python are there such requirements?
I disagree. I grepped through the source for s# and t#. Here's a bit of background. Before t# was introduced, s# was being used for two distinct purposes: (1) to get an 8-bit text string plus its length, in situations where the length was needed; (2) to get binary data (e.g. GIF data read from a file in "rb" mode). Greg pointed out that if we ever introduced some form of Unicode support, these two had to be disambiguated. We found that the majority of uses was for (2)! Therefore we decided to change the definition of s# to mean only (2), and introduced t# to mean (1). Also, we introduced getcharbuffer corresponding to t#, while getreadbuffer was meant for s#. Note that the definition of the 's' format was left alone -- as before, it means you need an 8-bit text string not containing null bytes. Our expectation was that a Unicode string passed to an s# situation would give a pointer to the internal format plus a byte count (not a character count!) while t# would get a pointer to some kind of 8-bit translation/encoding plus a byte count, with the explicit requirement that the 8-bit translation would have the same lifetime as the original unicode object. We decided to leave it up to the next generation (i.e., Marc-Andre :-) to decide what kind of translation to use and what to do when there is no reasonable translation. Any of the following choices is acceptable (from the point of view of not breaking the intended t# semantics; we can now start deciding which we like best): - utf-8 - latin-1 - ascii - shift-jis - lower byte of unicode ordinal - some user- or os-specified multibyte encoding As far as t# is concerned, for encodings that don't encode all of Unicode, untranslatable characters could be dealt with in any number of ways (raise an exception, ignore, replace with '?', make best effort, etc.). Given the current context, it should probably be the same as the default encoding -- i.e., utf-8. If we end up making the default user-settable, we'll have to decide what to do with untranslatable characters -- but that will probably be decided by the user too (it would be a property of a specific translation specification). In any case, I feel that t# could receive a multi-byte encoding, s# should receive raw binary data, and they should correspond to getcharbuffer and getreadbuffer, respectively. (Aside: the symmetry between 's' and 's#' is now lost; 's' matches 't#', there's no match for 's#'.)
This is a red herring given my explanation above.
This is for C code. Quite likely it *does* know what the internal data format is!
That was before t# was introduced. No more, alas. If you replace s# with t#, I agree with you completely.
(and t#, which is more relevant here)
Agreed. I think this was our thinking when Greg & I introduced t#. My own preference would be to allocate a whole string object, not just a buffer; this could then also be used for the .encode() method using the default encoding.
Yes, I want it. Note that this doesn't guarantee that all old extensions will work flawlessly when passed Unicode objects; but I think that it covers most cases where you could have a reasonable expectation that it works. (Hm, unfortunately many reasonable expectations seem to involve the current user's preferred encoding. :-( )
No! That is supposed to give the native representation of the string object. I agree that Mark's problem requires a solution too, but it doesn't have to use existing formatting characters, since there's no backwards compatibility issue.
This might work. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
On purpose -- according to my thinking. I see "t#" as an interface to bf_getcharbuf which I understand as 8-bit character buffer... UTF-8 is a multi byte encoding. It still is character data, but not necessarily 8 bits in length (up to 24 bits are used). Anyway, I'm not really interested in having an argument about this. If you say, "t#" fits the purpose, then that's fine with me. Still, we should clearly define that "t#" returns text data and "s#" binary data. Encoding, bit length, etc. should explicitly remain left undefined.
Fair enough :-)
It was my understanding that "t#" refers to single byte character data. That's where the above arguments were aiming at...
I know its too late now, but I can't really follow the arguments here: in what ways are (1) and (2) different from the implementations point of view ? If "t#" is to return UTF-8 then <length of the buffer> will not equal <text length>, so both parser markers return essentially the same information. The only difference would be on the semantic side: (1) means: give me text data, while (2) does not specify the data type. Perhaps I'm missing something...
This definition should then be changed to "text string without null bytes" dropping the 8-bit reference.
Hmm, I would strongly object to making "s#" return the internal format. file.write() would then default to writing UTF-16 data instead of UTF-8 data. This could result in strange errors due to the UTF-16 format being endian dependent. It would also break the symmetry between file.write(u) and unicode(file.read()), since the default encoding is not used as internal format for other reasons (see proposal).
I think we have already agreed on using UTF-8 for the default encoding. It has quite a few advantages. See http://czyborra.com/utf/ for a good overview of the pros and cons.
The usual Python way would be: raise an exception. This is what the proposal defines for Codecs in case an encoding/decoding mapping is not possible, BTW. (UTF-8 will always succeed on output.)
Why would you want to have "s#" return the raw binary data for Unicode objects ? Note that it is not mentioned anywhere that "s#" and "t#" do have to necessarily return different things (binary being a superset of text). I'd opt for "s#" and "t#" both returning UTF-8 data. This can be implemented by delegating the buffer slots to the <defencstr> object (see below).
C code can use the PyUnicode_* APIs to access the data. I don't think that argument parsing is powerful enough to provide the C code with enough information about the data contents, e.g. it can only state the encoding length, not the string length.
Done :-)
Good point. I'll change <defencbuf> to <defencstr>, a Python string object created on request.
-- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 47 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Thanks for not picking an argument. Multibyte encodings typically have ASCII as a subset (in such a way that an ASCII string is represented as itself in bytes). This is the characteristic that's needed in my view.
t# refers to byte-encoded data. Multibyte encodings are explicitly designed to be passed cleanly through processing steps that handle single-byte character data, as long as they are 8-bit clean and don't do too much processing.
The idea is that (1)/s# disallows any translation of the data, while (2)/t# requires translation of the data to an ASCII superset (possibly multibyte, such as UTF-8 or shift-JIS). (2)/t# assumes that the data contains text and that if the text consists of only ASCII characters they are represented as themselves. (1)/s# makes no such assumption. In terms of implementation, Unicode objects should translate themselves to the default encoding for t# (if possible), but they should make the native representation available for s#. For example, take an encryption engine. While it is defined in terms of byte streams, there's no requirement that the bytes represent characters -- they could be the bytes of a GIF file, an MP3 file, or a gzipped tar file. If we pass Unicode to an encryption engine, we want Unicode to come out at the other end, not UTF-8. (If we had wanted to encrypt UTF-8, we should have fed it UTF-8.)
Aha, I think there's a confusion about what "8-bit" means. For me, a multibyte encoding like UTF-8 is still 8-bit. Am I alone in this? (As far as I know, C uses char* to represent multibyte characters.) Maybe we should disambiguate it more explicitly?
But this was the whole design. file.write() needs to be changed to use s# when the file is open in binary mode and t# when the file is open in text mode.
If the file is encoded using UTF-16 or UCS-2, you should open it in binary mode and use unicode(file.read(), 'utf-16'). (Or perhaps the app should read the first 2 bytes and check for a BOM and then decide to choose bewteen 'utf-16-be' and 'utf-16-le'.)
Of course. I was just presenting the list as an argument that if we changed our mind about the default encoding, t# should follow the default encoding (and not pick an encoding by other means).
Did you read Andy Robinson's case study? He suggested that for certain encodings there may be other things you can do that are more user-friendly than raising an exception, depending on the application. I am proposing to leave this a detail of each specific translation. There may even be translations that do the same thing except they have a different behavior for untranslatable cases -- e.g. a strict version that raises an exception and a non-strict version that replaces bad characters with '?'. I think this is one of the powers of having an extensible set of encodings.
Because file.write() for a binary file, and other similar things (e.g. the encryption engine example I mentioned above) must have *some* way to get at the raw bits.
This would defeat the whole purpose of introducing t#. We might as well drop t# then altogether if we adopt this.
Typically, all the C code does is pass multibyte encoded strings on to other library routines that know what to do to them, or simply give them back unchanged at a later time. It is essential to know the number of bytes, for memory allocation purposes. The number of characters is totally immaterial (and multibyte-handling code knows how to calculate the number of characters anyway).
--Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not "8-bit clean" as you obviously did.
There should be some definition for the two markers and the ideas behind them in the API guide, I guess.
Ok, that would make the situation a little clearer (even though I expect the two different encodings to produce some FAQs). I still don't feel very comfortable about the fact that all existing APIs using "s#" will suddenly receive UTF-16 data if being passed Unicode objects: this probably won't get us the "magical" Unicode integration we invision, since "t#" usage is not very wide spread and character handling code will probably not work well with UTF-16 encoded strings. Anyway, we should probably try out both methods...
Right, that's the idea (there is a note on this in the Standard Codec section of the proposal).
Ok.
Agreed, the Codecs should decide for themselves what to do. I'll add a note to the next version of the proposal.
What for ? Any lossless encoding should do the trick... UTF-8 is just as good as UTF-16 for binary files; plus it's more compact for ASCII data. I don't really see a need to get explicitly at the internal data representation because both encodings are in fact "internal" w/r to Unicode objects. The only argument I can come up with is that using UTF-16 for binary files could (possibly) eliminate the UTF-8 conversion step which is otherwise always needed.
Well... yes ;-)
-- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Mon, 15 Nov 1999, M.-A. Lemburg wrote:
Hrm. That might be dangerous. Many of the functions that use "t#" assume that each character is 8-bits long. i.e. the returned length == the number of characters. I'm not sure what the implications would be if you interpret the semantics of "t#" as multi-byte characters.
Heck. I just want to quickly throw the data onto my disk. I'll write a BOM, following by the raw data. Done. It's even portable.
Maybe. I don't see multi-byte characters as 8-bit (in the sense of the "t" format).
(As far as I know, C uses char* to represent multibyte characters.) Maybe we should disambiguate it more explicitly?
We can disambiguate with a new format character, or we can clarify the semantics of "t" to mean single- *or* multi- byte characters. Again, I think there may be trouble if the semantics of "t" are defined to allow multibyte characters.
There should be some definition for the two markers and the ideas behind them in the API guide, I guess.
Certainly. [ man, I'm bad... I've got doc updates there and for the buffer stuff :-( ]
Interesting idea, but that presumes that "t" will be defined for the Unicode object (i.e. it implements the getcharbuffer type slot). Because of the multi-byte problem, I don't think it will. [ not to mention, that I don't think the Unicode object should implicitly do a UTF-8 conversion and hold a ref to the resulting string ]
I'm not sure that we should definitely go for "magical." Perl has magic in it, and that is one of its worst faults. Go for clean and predictable, and leave as much logic to the Python level as possible. The interpreter should provide a minimum of functionality, rather than second-guessing and trying to be neat and sneaky with its operation.
How about: "because I'm the application developer, and I say that I want the raw bytes in the file."
The argument that I come up with is "don't tell me how to design my storage format, and don't make Python force me into one." If I want to write Unicode text to a file, the most natural thing to do is: open('file', 'w').write(u) If you do a conversion on me, then I'm not writing Unicode. I've got to go and do some nasty conversion which just monkeys up my program. If I have a Unicode object, but I *want* to write UTF-8 to the file, then the cleanest thing is: open('file', 'w').write(encode(u, 'utf-8')) This is clear that I've got a Unicode object input, but I'm writing UTF-8. I have a second argument, too: See my first argument. :-) Really... this is kind of what Fredrik was trying to say: don't get in the way of the application programmer. Give them tools, but avoid policy and gimmicks and other "magic". Cheers, -g -- Greg Stein, http://www.lyra.org/

Greg Stein wrote:
FYI, the next version of the proposal now says "s#" gives you UTF-16 and "t#" returns UTF-8. File objects opened in text mode will use "t#" and binary ones use "s#". I'll just use explicit u.encode('utf-8') calls if I want to write UTF-8 to binary files -- perhaps everyone else should too ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Good.
I'll just use explicit u.encode('utf-8') calls if I want to write UTF-8 to binary files -- perhaps everyone else should too ;-)
You could write UTF-8 to files opened in text mode too; at least most actual systems will leave the UTF-8 escapes alone and just to LF -> CRLF translation, which should be fine. --Guido van Rossum (home page: http://www.python.org/~guido/)

[MAL]
FYI, the next version of the proposal ... File objects opened in text mode will use "t#" and binary ones use "s#".
Am I the only one who sees magical distinctions between text and binary mode as a Really Bad Idea? I wouldn't have guessed the Unix natives here would quietly acquiesce to importing a bit of Windows madness <wink>.

On Wed, 17 Nov 1999, Tim Peters wrote:
It's a seductive idea... yes, it feels wrong, but then... it seems kind of right, too... :-) Yes. It is a mode. Is it bad? Not sure. You've already told the system that you want to treat the file differently. Much like you're treating it differently when you specify 'r' vs. 'w'. The real annoying thing would be to assume that opening a file as 'r' means that I *meant* text mode and to start using "t#". In actuality, I typically open files that way since I do most of my coding on Linux. If I now have to pay attention to things and open it as 'rb', then I'll be pissed. And the change in behavior and bugs that interpreting 'r' as text would introduce? Ack! Cheers, -g -- Greg Stein, http://www.lyra.org/

[MAL]
File objects opened in text mode will use "t#" and binary ones use "s#".
[Greg Stein]
Isn't that exactly what MAL said would happen? Note that a "t" flag for "text mode" is an MS extension -- C doesn't define "t", and Python doesn't either; a lone "r" has always meant text mode.
'r' is already intepreted as text mode, but so far, on Unix-like systems, there's been no difference between text and binary modes. Introducing a distinction will certainly cause problems. I don't know what the compensating advantages are thought to be.

On Wed, 17 Nov 1999, Tim Peters wrote:
Wow. "compensating advantages" ... Excellent "power phrase" there. hehe... -g -- Greg Stein, http://www.lyra.org/

Tim Peters wrote:
Em, I think you've got something wrong here: "t#" refers to the parsing marker used for writing data to files opened in text mode. Until now, all files used the "s#" parsing marker for writing data, regardeless of being opened in text or binary mode. The new interpretation (new, because there previously was none ;-) of the buffer interface forces this to be changed to regain conformance.
I guess you won't notice any difference: strings define both interfaces ("s#" and "t#") to mean the same thing. Only other buffer compatible types may now fail to write to text files -- which is not so bad, because it forces the programmer to rethink what he really intended when opening the file in text mode. Besides, if you are writing portable scripts you should pay close attention to "r" vs. "rb" anyway. [Strange, I find myself argueing for a feature that I don't like myself ;-)] -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Thu, 18 Nov 1999, M.-A. Lemburg wrote:
Nope. We've got it right :-) Tim and I used 'r' and "t" to refer to file-open modes. I used "t#" to refer to the parse marker.
It *is* bad if it breaks my existing programs in subtle ways that are a bitch to track down.
Besides, if you are writing portable scripts you should pay close attention to "r" vs. "rb" anyway.
I'm not writing portable scripts. I mentioned that once before. I don't want a difference between 'r' and 'rb' on my Linux box. It was never there before, I'm lazy, and I don't want to see it added :-). Honestly, I don't know offhand of any Python types that repond to "s#" and "t#" in different ways, such that changing file.write would end up writing something different (and thereby breaking existing code). I just don't like introduce text/binary to *nix platforms where it didn't exist before. Cheers, -g -- Greg Stein, http://www.lyra.org/

Greg> I'm not writing portable scripts. I mentioned that once before. I Greg> don't want a difference between 'r' and 'rb' on my Linux box. It Greg> was never there before, I'm lazy, and I don't want to see it added Greg> :-). ... Greg> I just don't like introduce text/binary to *nix platforms where it Greg> didn't exist before. I'll vote with Greg, Guido's cross-platform conversion not withstanding. If I haven't been writing portable scripts up to this point because I only care about a single target platform, why break my scripts for me? Forcing me to use "rb" or "wb" on my open calls isn't going to make them portable anyway. There are probably many other harder to identify and correct portability issues than binary file access anyway. Seems like requiring "b" is just going to cause gratuitous breakage with no obvious increase in portability. porta-nanny.py-anyone?-ly y'rs, Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented...

Greg Stein wrote:
Ah, ok. But "t" as file opener is non-portable anyways, so I'll skip it here :-)
Please remember that up until now you were probably only using strings to write to files. Python strings don't differentiate between "t#" and "s#" so you wont see any change in function or find subtle errors being introduced. If you are already using the buffer feature for e.g. array which also implement "s#" but don't support "t#" for obvious reasons you'll run into trouble, but then: arrays are binary data, so changing from text mode to binary mode is well worth the effort even if you just consider it a nuisance. Since the buffer interface and its consequences haven't published yet, there are probably very few users out there who would actually run into any problems. And even if they do, its a good chance to catch subtle bugs which would only have shown up when trying to port to another platform. I'll leave the rest for Guido to answer, since it was his idea ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[MAL]
Breaking existing code that works should be considered more than a nuisance. However, one answer would be to have "t#" _prefer_ to use the text buffer, but not insist on it. eg, the logic for processing "t#" could check if the text buffer is supported, and if not move back to the blob buffer. This should mean that all existing code still works, except for objects that support both buffers to mean different things. AFAIK there are no objects that qualify today, so it should work fine. Unix users _will_ need to revisit their thinking about "text mode" vs "binary mode" when writing these new objects (such as Unicode), but IMO that is more than reasonable - Unix users dont bother qualifying the open mode of their files, simply because it has no effect on their files. If for certain objects or requirements there _is_ a distinction, then new code can start to think these issues through. "Portable File IO" will simply be extended from simply "portable among all platforms" to "portable among all platforms and objects". Mark.

Mark Hammond wrote:
Its an error that pretty easy to fix... that's what I was referring to with "nuisance". All you have to do is open the file in binary mode and you're done. BTW, the change will only effect platforms that don't differ between text and binary mode, e.g. Unix ones.
I doubt that this is conform to what the buffer interface want's to reflect: if the getcharbuf slot is not implemented this means "I am not text". If you would write non-text to a text file, this may cause line breaks to be interpreted in ways that are incompatible with the binary data, i.e. when you read the data back in, it may fail to load because e.g. '\n' was converted to '\r\n'.
Well, even though the code would work, it might break badly someday for the above reasons. Better fix that now when there aren't too many possible cases around than at some later point where the user has to figure out the problem for himself due to the system not warning him about this.
Right. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 42 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

>> FYI, the next version of the proposal ... File objects opened in >> text mode will use "t#" and binary ones use "s#". Tim> Am I the only one who sees magical distinctions between text and Tim> binary mode as a Really Bad Idea? No. Tim> I wouldn't have guessed the Unix natives here would quietly Tim> acquiesce to importing a bit of Windows madness <wink>. We figured you and Guido would come to our rescue... ;-) Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented...

Don't count on me. My brain is totally cross-platform these days, and writing "rb" or "wb" for files containing binary data is second nature for me. I actually *like* it. Anyway, the Unicode stuff ought to have a wrapper open(filename, mode, encoding) where the 'b' will be added to the mode if you don't give it and it's needed. --Guido van Rossum (home page: http://www.python.org/~guido/)

Hrm. Can you quote examples of users of t# who would be confused by multibyte characters? I guess that there are quite a few places where they will be considered illegal, but that's okay -- the string will be parsed at some point and rejected, e.g. as an illegal filename, hostname or whatever. On the other hand, there are quite some places where I would think that multibyte characters would do just the right thing. Many places using t# could just as well be using 's' except they need to know the length and they don't want to call strlen(). In all cases I've looked at, the reason they need the length because they are allocating a buffer (or checking whether it fits in a statically allocated buffer) -- and there the number of bytes in a multibyte string is just fine. Note that I take the same stance on 's' -- it should return multibyte characters.
Here I'm with you, man!
Greg Stein, http://www.lyra.org/
--Guido van Rossum (home page: http://www.python.org/~guido/)

Greg Stein writes:
[ man, I'm bad... I've got doc updates there and for the buffer stuff :-( ]
And the sooner I receive them, the sooner they can be integrated! Any plans to get them to me? I'll probably want to do another release before the IPC8. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

M.-A. Lemburg writes:
Perhaps I missed the agreement that these should always receive UTF-8 from Unicode strings. Was this agreed upon, or has it simply not been argued over in favor of other topics? If this has indeed been agreed upon... at least it can be computed on demand rather than at initialization! Perhaps there should be two pointers: one to the UTF-8 buffer and one to a PyObject; if the PyObject is there it's a "old-style" string that's actually providing the buffer. This may or may not be a good idea; there's a lot of memory expense for long Unicode strings converted from UTF-8 that aren't ever converted back to UTF-8 or accessed using "s" or "s#". Ok, I've talked myself out of that. ;-) -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

Fred L. Drake, Jr. <fdrake@acm.org> wrote:
from unicode import * def getname(): # hidden in some database engine, or so... return unicode("Linköping", "iso-8859-1") ... name = getname() # emulate automatic conversion to utf-8 name = str(name) # print it in uppercase, in the usual way import string print string.upper(name) ## LINKöPING I don't know, but I think that I think that it perhaps should raise an exception instead... </F>

"Fred L. Drake, Jr." wrote:
It's been in the proposal since version 0.1. The idea is to provide a decent way of making existing script Unicode aware.
If this has indeed been agreed upon... at least it can be computed on demand rather than at initialization!
This is what I intended to implement. The <defencbuf> buffer will be filled upon the first request to the UTF-8 encoding. "s" and "s#" are examples of such requests. The buffer will remain intact until the object is destroyed (since other code could store the pointer received via e.g. "s").
Note that Unicode object are completely different beast ;-) String object are not touched in any way by the proposal. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg writes:
It's been in the proposal since version 0.1. The idea is to provide a decent way of making existing script Unicode aware.
Ok, so I haven't read closely enough.
Right.
Note that Unicode object are completely different beast ;-) String object are not touched in any way by the proposal.
I wasn't suggesting the PyStringObject be changed, only that the PyUnicodeObject could maintain a reference. Consider: s = fp.read() u = unicode(s, 'utf-8') u would now hold a reference to s, and s/s# would return a pointer into s instead of re-building the UTF-8 form. I talked myself out of this because it would be too easy to keep a lot more string objects around than were actually needed. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

"Fred L. Drake, Jr." wrote:
Agreed. Also, the encoding would always be correct. <defencbuf> will always hold the <default encoding> version (which should be UTF-8...). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Tim Peters writes:
Yet another use for a weak reference <0.5 wink>.
Those just keep popping up! I seem to recall Diane Hackborne actually implemented these under the name "vref" long ago; perhaps that's worth revisiting after all? (Not the implementation so much as the idea.) I think to make it general would cost one PyObject* in each object's structure, and some code in some constructors (maybe), and all destructors, but not much. Is this worth pursuing, or is it locked out of the core because of the added space for the PyObject*? (Note that the concept isn't necessarily useful for all object types -- numbers in particular -- but it only makes sense to bother if it works for everything, even if it's not very useful in some cases.) -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

"Fred L. Drake, Jr." wrote:
FYI, there's mxProxy which implements a flavor of them. Look in the standard places for mx stuff ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg writes:
FYI, there's mxProxy which implements a flavor of them. Look in the standard places for mx stuff ;-)
Yes, but still not in the core. So we have two general examples (vrefs and mxProxy) and there's WeakDict (or something like that). I think there really needs to be a core facility for this. There are a lot of users (including myself) who think that things are far less useful if they're not in the core. (No, I'm not saying that everything should be in the core, or even that it needs a lot more stuff. I just don't want to be writing code that requires a lot of separate packages to be installed. At least not until we can tell an installation tool to "install this and everything it depends on." ;) -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

[Fred L. Drake, Jr., pines for some flavor of weak refs; MAL reminds us of his work; & back to Fred]
This kind of thing certainly belongs in the core (for efficiency and smooth integration) -- if it belongs in the language at all. This was discussed at length here some months ago; that's what prompted MAL to "do something" about it. Guido hasn't shown visible interest, and nobody has been willing to fight him to the death over it. So it languishes. Buy him lunch tomorrow and get him excited <wink>.

Tim Peters writes:
Guido has asked me to pursue this topic, so I'll be checking out available implementations and seeing if any are adoptable or if something different is needed to be fully general and well-integrated. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

[Fred L. Drake, Jr.]
Just don't let "fully general" stop anything for its sake alone; e.g., if there's a slick trick that *could* exempt numbers, that's all to the good! Adding a pointer to every object is really unattractive, while adding a flag or two to type objects is dirt cheap. Note in passing that current Java addresses weak refs too (several flavors of 'em! -- very elaborate).

On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
Bull! You can easily support "s#" support by returning the pointer to the Unicode buffer. The *entire* reason for introducing "t#" is to differentiate between returning a pointer to an 8-bit [character] buffer and a not-8-bit buffer. In other words, the work done to introduce "t#" was done *SPECIFICALLY* to allow "s#" to return a pointer to the Unicode data. I am with Fredrik on that auxilliary buffer. You'll have two dead bodies to deal with :-) Cheers, -g -- Greg Stein, http://www.lyra.org/

I am with Fredrik on that auxilliary buffer. You'll have two dead bodies to deal with :-)
I haven't made up my mind yet (due to a very successful Python-promoting visit to SD'99 east, I'm about 100 msgs behind in this thread alone) but let me warn you that I can deal with the carnage, if necessary. :-) --Guido van Rossum (home page: http://www.python.org/~guido/)

On Sat, 13 Nov 1999, Guido van Rossum wrote:
Bring it on, big boy! :-) -- Greg Stein, http://www.lyra.org/

On Fri, 12 Nov 1999, Tim Peters wrote:
No... my main point was interaction with the underlying OS. I made a SWAG (Scientific Wild Ass Guess :-) and stated that UTF-8 is probably slower for various types of operations. As always, your infernal meddling has dashed that hypothesis, so I must retreat...
Probably for the exact reason that you stated in your messages: many 8-bit (7-bit?) functions continue to work quite well when given a UTF-8-encoded string. i.e. they didn't have to rewrite the entire Perl/TCL interpreter to deal with a new string type. I'd guess it is a helluva lot easier for us to add a Python Type than for Perl or TCL to whack around with new string types (since they use strings so heavily). Cheers, -g -- Greg Stein, http://www.lyra.org/

Greg Stein wrote:
I know, but this is a little different: you use strings a lot while import hooks are rarely used directly by the user. E.g. people in Europe will probably prefer Latin-1 as default encoding while people in Asia will use one of the common CJK encodings. The <default encoding> decides what encoding to use for many typical tasks: printing, str(u), "s" argument parsing, etc. Note that setting the <default encoding> is not intended to be done prior to single operations. It is meant to be settable at thread creation time.
The reason for UTF-16 is simply that it is identical to UCS-2 over large ranges which makes optimizations (e.g. the UCS2 flag I mentioned in an earlier post) feasable and effective. UTF-8 slows things down for CJK encodings, since the APIs will very often have to scan the string to find the correct logical position in the data. Here's a quote from the Unicode FAQ (http://www.unicode.org/unicode/faq/ ): """ Q: How about using UCS-4 interfaces in my APIs? Given an internal UTF-16 storage, you can, of course, still index into text using UCS-4 indices. However, while converting from a UCS-4 index to a UTF-16 index or vice versa is fairly straightforward, it does involve a scan through the 16-bit units up to the index point. In a test run, for example, accessing UTF-16 storage as UCS-4 characters results in a 10X degradation. Of course, the precise differences will depend on the compiler, and there are some interesting optimizations that can be performed, but it will always be slower on average. This kind of performance hit is unacceptable in many environments. Most Unicode APIs are using UTF-16. The low-level character indexing are at the common storage level, with higher-level mechanisms for graphemes or words specifying their boundaries in terms of the storage units. This provides efficiency at the low levels, and the required functionality at the high levels. Convenience APIs can be produced that take parameters in UCS-4 methods for common utilities: e.g. converting UCS-4 indices back and forth, accessing character properties, etc. Outside of indexing, differences between UCS-4 and UTF-16 are not as important. For most other APIs outside of indexing, characters values cannot really be considered outside of their context--not when you are writing internationalized code. For such operations as display, input, collation, editing, and even upper and lowercasing, characters need to be considered in the context of a string. That means that in any event you end up looking at more than one character. In our experience, the incremental cost of doing surrogates is pretty small. """
All those formats are upward compatible (within certain ranges) and the Python Unicode API will provide converters between its internal format and the few common Unicode implementations, e.g. for MS compilers (16-bit UCS2 AFAIK), GLIBC (32-bit UCS4).
See above.
Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16. """ Note that there currently are no defined surrogate pairs for UTF-16, meaning that in practice the difference between UCS-2 and UTF-16 is probably negligable, e.g. we could define the internal format to be UTF-16 and raise exception whenever the border between UTF-16 and UCS-2 is crossed -- sort of as political compromise ;-). But... I think HP has the last word on this one. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

I can't make time for a close review now. Just one thing that hit my eye early: Python should provide a built-in constructor for Unicode strings which is available through __builtins__: u = unicode(<encoded Python string>[,<encoding name>= <default encoding>]) u = u'<utf-8 encoded Python string>' Two points on the Unicode literals (u'abc'): UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by hand -- it breaks apart and rearranges bytes at the bit level, and everything other than 7-bit ASCII requires solid strings of "high-bit" characters. This is painful for people to enter manually on both counts -- and no common reference gives the UTF-8 encoding of glyphs directly. So, as discussed earlier, we should follow Java's lead and also introduce a \u escape sequence: octet: hexdigit hexdigit unicodecode: octet octet unicode_escape: "\\u" unicodecode Inside a u'' string, I guess this should expand to the UTF-8 encoding of the Unicode character at the unicodecode code position. For consistency, then, it should probably expand the same way inside "regular strings" too. Unlike Java does, I'd rather not give it a meaning outside string literals. The other point is a nit: The vast bulk of UTF-8 encodings encode characters in UCS-4 space outside of Unicode. In good Pythonic fashion, those must either be explicitly outlawed, or explicitly defined. I vote for outlawed, in the sense of detected error that raises an exception. That leaves our future options open. BTW, is ord(unicode_char) defined? And as what? And does ord have an inverse in the Unicode world? Both seem essential. international-in-spite-of-himself-ly y'rs - tim

Tim Peters wrote:
unless you're using a UTF-8 aware editor, of course ;-) (some days, I think we need some way to tell the compiler what encoding we're using for the source file...)
good idea. and by some reason, patches for this is included in the unicode distribution (see the attached str2utf.c).
I vote for 'outlaw'. </F> /* A small code snippet that translates \uxxxx syntax to UTF-8 text. To be cut and pasted into Python/compile.c */ /* Written by Fredrik Lundh, January 1999. */ /* Documentation (for the language reference): \uxxxx -- Unicode character with hexadecimal value xxxx. The character is stored using UTF-8 encoding, which means that this sequence can result in up to three encoded characters. Note that the 'u' must be followed by four hexadecimal digits. If fewer digits are given, the sequence is left in the resulting string exactly as given. If more digits are given, only the first four are translated to Unicode, and the remaining digits are left in the resulting string. */ #define Py_CHARMASK(ch) ch void convert(const char *s, char *p) { while (*s) { if (*s != '\\') { *p++ = *s++; continue; } s++; switch (*s++) { /* -------------------------------------------------------------------- */ /* copy this section to the appropriate place in compile.c... */ case 'u': /* \uxxxx => UTF-8 encoded unicode character */ if (isxdigit(Py_CHARMASK(s[0])) && isxdigit(Py_CHARMASK(s[1])) && isxdigit(Py_CHARMASK(s[2])) && isxdigit(Py_CHARMASK(s[3]))) { /* fetch hexadecimal character value */ unsigned int n, ch = 0; for (n = 0; n < 4; n++) { int c = Py_CHARMASK(*s); s++; ch = (ch << 4) & ~0xF; if (isdigit(c)) ch += c - '0'; else if (islower(c)) ch += 10 + c - 'a'; else ch += 10 + c - 'A'; } /* store as UTF-8 */ if (ch < 0x80) *p++ = (char) ch; else { if (ch < 0x800) { *p++ = 0xc0 | (ch >> 6); *p++ = 0x80 | (ch & 0x3f); } else { *p++ = 0xe0 | (ch >> 12); *p++ = 0x80 | ((ch >> 6) & 0x3f); *p++ = 0x80 | (ch & 0x3f); } } break; } else goto bogus; /* -------------------------------------------------------------------- */ default: bogus: *p++ = '\\'; *p++ = s[-1]; break; } } *p++ = '\0'; } main() { int i; unsigned char buffer[100]; convert("Link\\u00f6ping", buffer); for (i = 0; buffer[i]; i++) if (buffer[i] < 0x20 || buffer[i] >= 0x80) printf("\\%03o", buffer[i]); else printf("%c", buffer[i]); }

[/F, dripping with code]
Yuck -- don't let probable error pass without comment. "must be" == "must be"! [moving backwards]
The code is fine, but I've gotten confused about what the intent is now. Expanding \uxxxx to its UTF-8 encoding made sense when MAL had UTF-8 literals, but now he's got Unicode-escaped literals instead -- and you favor an internal 2-byte-per-char Unicode storage format. In that combination of worlds, is there any use in the *language* (as opposed to in a runtime module) for \uxxxx -> UTF-8 conversion? And MAL, if you're listening, I'm not clear on what a Unicode-escaped literal means. When you had UTF-8 literals, the meaning of something like u"a\340\341" was clear, since UTF-8 is defined as a byte stream and UTF-8 string literals were just a way of specifying a byte stream. As a Unicode-escaped string, I assume the "a" maps to the Unicode "a", but what of the rest? Are the octal escapes to be taken as two separate Latin-1 characters (in their role as a Unicode subset), or as an especially clumsy way to specify a single 16-bit Unicode character? I'm afraid I'd vote for the former. Same issue wrt \x escapes. One other issue: are there "raw" Unicode strings too, as in ur"\u20ac"? There probably should be; and while Guido will hate this, a ur string should probably *not* leave \uxxxx escapes untouched. Nasties like this are why Java defines \uxxxx expansion as occurring in a preprocessing step. BTW, the meaning of \uxxxx in a non-Unicode string is now also unclear (or isn't \uxxxx allowed in a non-Unicode string? that's what I would do ...).

Tim Peters wrote:
I second that.
No, no... :-) I think it was a simple misunderstanding... \uXXXX is only to be used within u'' strings and then gets expanded to *one* character encoded in the internal Python format (which is heading towards UTF-16 without surrogates).
Good points. The conversion goes as follows: · for single characters (and this includes all \XXX sequences except \uXXXX), take the ordinal and interpret it as Unicode ordinal · for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX instead
Not sure whether we really need to make this even more complicated... The \uXXXX strings look ugly, adding a few \\\\ for e.g. REs or filenames won't hurt much in the context of those \uXXXX monsters :-)
BTW, the meaning of \uxxxx in a non-Unicode string is now also unclear (or isn't \uxxxx allowed in a non-Unicode string? that's what I would do ...).
Right. \uXXXX will only be allowed in u'' strings, not in "normal" strings. BTW, if you want to type in UTF-8 strings and have them converted to Unicode, you can use the standard: u = unicode('...string with UTF-8 encoded characters...','utf-8') -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[MAL]
Perfect! [about "raw" Unicode strings]
Alas, this won't stand over the long term. Eventually people will write Python using nothing but Unicode strings -- "regular strings" will eventurally become a backward compatibility headache <0.7 wink>. IOW, Unicode regexps and Unicode docstrings and Unicode formatting ops ... nothing will escape. Nor should it. I don't think it all needs to be done at once, though -- existing languages usually take years to graft in gimmicks to cover all the fine points. So, happy to let raw Unicode strings pass for now, as a relatively minor point, but without agreeing it can be ignored forever.
That's what I figured, and thanks for the confirmation.

Tim Peters wrote:
Thanks :-)
Agreed... note that you could also write your own codec for just this reason and then use: u = unicode('....\u1234...\...\...','raw-unicode-escaped') Put that into a function called 'ur' and you have: u = ur('...\u4545...\...\...') which is not that far away from ur'...' w/r to cosmetics.
-- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[MAL, on raw Unicode strings]
Well, not quite. In general you need to pass raw strings: u = unicode(r'....\u1234...\...\...','raw-unicode-escaped') ^ u = ur(r'...\u4545...\...\...') ^ else Python will replace all the other backslash sequences. This is a crucial distinction at times; e.g., else \b in a Unicode regexp will expand into a backspace character before the regexp processor ever sees it (\b is supposed to be a word boundary assertion).

Tim Peters <tim_one@email.msn.com> wrote:
(\b is supposed to be a word boundary assertion).
in some places, that is. </F> Main Entry: reg·u·lar Pronunciation: 're-gy&-l&r, 're-g(&-)l&r 1 : belonging to a religious order 2 a : formed, built, arranged, or ordered according to some established rule, law, principle, or type ... 3 a : ORDERLY, METHODICAL <regular habits> ... 4 a : constituted, conducted, or done in conformity with established or prescribed usages, rules, or discipline ...

Tim Peters wrote:
Right. Here is a sample implementation of what I had in mind: """ Demo for 'unicode-escape' encoding. """ import struct,string,re pack_format = '>H' def convert_string(s): l = map(None,s) for i in range(len(l)): l[i] = struct.pack(pack_format,ord(l[i])) return l u_escape = re.compile(r'\\u([0-9a-fA-F]{0,4})') def unicode_unescape(s): l = [] start = 0 while start < len(s): m = u_escape.search(s,start) if not m: l[len(l):] = convert_string(s[start:]) break m_start,m_end = m.span() if m_start > start: l[len(l):] = convert_string(s[start:m_start]) hexcode = m.group(1) #print hexcode,start,m_start if len(hexcode) != 4: raise SyntaxError,'illegal \\uXXXX sequence: \\u%s' % hexcode ordinal = string.atoi(hexcode,16) l.append(struct.pack(pack_format,ordinal)) start = m_end #print l return string.join(l,'') def hexstr(s,sep=''): return string.join(map(lambda x,hex=hex,ord=ord: '%02x' % ord(x),s),sep) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[MAL]
It looks like r'\\u0000' will get translated into a 2-character Unicode string. That's probably not good, if for no other reason than that Java would not do this (it would create the obvious 7-character Unicode string), and having something that looks like a Java escape that doesn't *work* like the Java escape will be confusing as heck for JPython users. Keeping track of even-vs-odd number of backslashes can't be done with a regexp search, but is easy if the code is simple <wink>: def unicode_unescape(s): from string import atoi import array i, n = 0, len(s) result = array.array('H') # unsigned short, native order while i < n: ch = s[i] i = i+1 if ch != "\\": result.append(ord(ch)) continue if i == n: raise ValueError("string ends with lone backslash") ch = s[i] i = i+1 if ch != "u": result.append(ord("\\")) result.append(ord(ch)) continue hexchars = s[i:i+4] if len(hexchars) != 4: raise ValueError("\\u escape at end not followed by " "at least 4 characters") i = i+4 for ch in hexchars: if ch not in "01234567890abcdefABCDEF": raise ValueError("\\u" + hexchars + " contains " "non-hex characters") result.append(atoi(hexchars, 16)) # print result return result.tostring()

Tim Peters wrote:
Right...
Guido and I have decided to turn \uXXXX into a standard escape sequence with no further magic applied. \uXXXX will only be expanded in u"" strings. Here's the new scheme: With the 'unicode-escape' encoding being defined as: · all non-escape characters represent themselves as a Unicode ordinal (e.g. 'a' -> U+0061). · all existing defined Python escape sequences are interpreted as Unicode ordinals; note that \xXXXX can represent all Unicode ordinals, and \OOO (octal) can represent Unicode ordinals up to U+01FF. · a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax error to have fewer than 4 digits after \u. Examples: u'abc' -> U+0061 U+0062 U+0063 u'\u1234' -> U+1234 u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+05c Now how should we define ur"abc\u1234\n" ... ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 44 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[MAL]
Does that exclude ur"" strings? Not arguing either way, just don't know what all this means.
Same as before (scream if that's wrong).
· all existing defined Python escape sequences are interpreted as Unicode ordinals;
Same as before (ditto).
note that \xXXXX can represent all Unicode ordinals,
This means that the definition of \xXXXX has changed, then -- as you pointed out just yesterday <wink>, \xABCDq currently acts like \xCDq. Does the new \x definition apply only in u"" strings, or in "" strings too? What is the new \x definition?
and \OOO (octal) can represent Unicode ordinals up to U+01FF.
Same as before (ditto).
· a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax error to have fewer than 4 digits after \u.
Same as before (ditto). IOW, I don't see anything that's changed other than an unspecified new treatment of \x escapes, and possibly that ur"" strings don't expand \u escapes.
The last example is damaged (U+05c isn't legit). Other than that, these look the same as before.
Now how should we define ur"abc\u1234\n" ... ?
If strings carried an encoding tag with them, the obvious answer is that this acts exactly like r"abc\u1234\n" acts today except gets a "unicode-escaped" encoding tag instead of a "[whatever the default is today]" encoding tag. If strings don't carry an encoding tag with them, you're in a bit of a pickle: you'll have to convert it to a regular string or a Unicode string, but in either case have no way to communicate that it may need further processing; i.e., no way to distinguish it from a regular or Unicode string produced by any other mechanism. The code I posted yesterday remains my best answer to that unpleasant puzzle (i.e., produce a Unicode string, fiddling with backslashes just enough to get the \u escapes expanded, in the same way Java's (conceptual) preprocessor does it).

Tim Peters wrote:
Guido decided to make \xYYXX return U+YYXX *only* within u"" strings. In "" (Python strings) the same sequence will result in chr(0xXX).
The difference is that we no longer take the two step approach. \uXXXX is treated at the same time all other escape sequences are decoded (the previous version first scanned and decoded all standard Python sequences and then turned to the \uXXXX sequences in a second scan).
Corrected; thanks.
They don't have such tags... so I guess we're in trouble ;-) I guess to make ur"" have a meaning at all, we'd need to go the Java preprocessor way here, i.e. scan the string *only* for \uXXXX sequences, decode these and convert the rest as-is to Unicode ordinals. Would that be ok ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Read Tim's code (posted about 40 messages ago in this list). Like Java, it interprets \u.... when the number of backslashes is odd, but not when it's even. So \\u.... returns exactly that, while \\\u.... returns two backslashes and a unicode character. This is nice and can be done regardless of whether we are going to interpret other \ escapes or not. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
I did, but wasn't sure whether he was argueing for going the Java way...
So I'll take that as: this is what we want in Python too :-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

I'll reserve judgement until we've got some experience with it in the field, but it seems the best compromise. It also gives a clear explanation about why we have \uXXXX when we already have \xXXXX. --Guido van Rossum (home page: http://www.python.org/~guido/)

Would this definition be fine ? """ u = ur'<raw-unicode-escape encoded Python string>' The 'raw-unicode-escape' encoding is defined as follows: · \uXXXX sequence represent the U+XXXX Unicode character if and only if the number of leading backslashes is odd · all other characters represent themselves as Unicode ordinal (e.g. 'b' -> U+0062) """ -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Yes. --Guido van Rossum (home page: http://www.python.org/~guido/)

Tim Peters wrote:
It would be more conform to use the Unicode ordinal (instead of interpreting the number as UTF8 encoding), e.g. \u03C0 for Pi. The codes are easy to look up in the standard's UnicodeData.txt file or the Unicode book for that matter.
See my other post for a discussion of UCS4 vs. UTF16 vs. UCS2. Perhaps we could add a flag to Unicode objects stating whether the characters can be treated as UCS4 limited to the lower 16 bits (UCS4 and UTF16 are the same in most ranges). This flag could then be used to choose optimized algorithms for scanning the strings. Fredrik's implementation currently uses UCS2, BTW.
BTW, is ord(unicode_char) defined? And as what? And does ord have an inverse in the Unicode world? Both seem essential.
Good points. How about uniord(u[:1]) --> Unicode ordinal number (32-bit) unichr(i) --> Unicode object for character i (provided it is 32-bit); ValueError otherwise They are inverse of each other, but note that Unicode allows private encodings too, which will of course not necessarily make it across platforms or even from one PC to the next (see Andy Robinson's interesting case study). I've uploaded a new version of the proposal (0.3) to the URL: http://starship.skyport.net/~lemburg/unicode-proposal.txt Thanks, -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Wed, 10 Nov 1999, M.-A. Lemburg wrote:
Why new functions? Why not extend the definition of ord() and chr()? In terms of backwards compatibility, the only issue could possibly be that people relied on chr(x) to throw an error when x>=256. They certainly couldn't pass a Unicode object to ord(), so that function can safely be extended to accept a Unicode object and return a larger integer. Cheers, -g -- Greg Stein, http://www.lyra.org/

Greg Stein wrote:
Because unichr() will always have to return Unicode objects. You don't want chr(i) to return Unicode for i>255 and strings for i<256. OTOH, ord() could probably be extended to also work on Unicode objects. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[MAL, on Unicode chr() and ord()
Indeed I do not!
OTOH, ord() could probably be extended to also work on Unicode objects.
I think should be -- it's a good & natural use of polymorphism; introducing a new function *here* would be as odd as introducing a unilen() function to get the length of a Unicode string.

Tim Peters wrote:
Fine. So I'll drop the uniord() API and extend ord() instead. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Guido van Rossum <guido@CNRI.Reston.VA.US> wrote:
Marc-Andre writes: The internal format for Unicode objects should either use a Python specific fixed cross-platform format <PythonUnicode> (e.g. 2-byte little endian byte order) or a compiler provided wchar_t format (if available). Using the wchar_t format will ease embedding of Python in other Unicode aware applications, but will also make internal format dumps platform dependent. having been there and done that, I strongly suggest a third option: a 16-bit unsigned integer, in platform specific byte order (PY_UNICODE_T). along all other roads lie code bloat and speed penalties... (besides, this is exactly how it's already done in unicode.c and what 'sre' prefers...) </F>

Fredrik Lundh wrote:
Ok, byte order can cause a speed penalty, so it might be worthwhile introducing sys.bom (or sys.endianness) for this reason and sticking to 16-bit integers as you have already done in unicode.h. What I don't like is using wchar_t if available (and then addressing it as if it were defined as unsigned integer). IMO, it's better to define a Python Unicode representation which then gets converted to whatever wchar_t represents on the target machine. Another issue is whether to use UCS2 (as you have done) or UTF16 (which is what Unicode 3.0 requires)... see my other post for a discussion. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

you should read the unicode.h file a bit more carefully: ... /* Unicode declarations. Tweak these to match your platform */ /* set this flag if the platform has "wchar.h", "wctype.h" and the wchar_t type is a 16-bit unsigned type */ #define HAVE_USABLE_WCHAR_H #if defined(WIN32) || defined(HAVE_USABLE_WCHAR_H) (this uses wchar_t, and also iswspace and friends) ... #else /* Use if you have a standard ANSI compiler, without wchar_t support. If a short is not 16 bits on your platform, you have to fix the typedef below, or the module initialization code will complain. */ (this maps iswspace to isspace, for 8-bit characters). #endif ... the plan was to use the second solution (using "configure" to figure out what integer type to use), and its own uni- code database table for the is/to primitives (iirc, the unicode.txt file discussed this, but that one seems to be missing from the zip archive). </F>

Fredrik Lundh wrote:
Oh, I did read unicode.h, stumbled across the mixed usage and decided not to like it ;-) Seriously, I find the second solution where you use the 'unsigned short' much more portable and straight forward. You never know what the compiler does for isw*() and it's probably better sticking to one format for all platforms. Only endianness gets in the way, but that's easy to handle. So I opt for 'unsigned short'. The encoding used in these 2 bytes is a different question though. If HP insists on Unicode 3.0, there's probably no other way than to use UTF-16.
(iirc, the unicode.txt file discussed this, but that one seems to be missing from the zip archive).
It's not in the file I downloaded from your site. Could you post it here ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Fredrik Lundh writes:
I actually like this best, but I understand that there are reasons for using wchar_t, especially for interfacing with other code that uses Unicode. Perhaps someone who knows more about the specific issues with interfacing using wchar_t can summarize them, or point me to whatever I've already missed. p-) -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

On Wed, 10 Nov 1999, Fredrik Lundh wrote:
I agree 100% !! wchar_t will introduce portability issues right on up into the Python level. The byte-order introduces speed issues and OS interoperability issues, yet solves no portability problems (Byte Order Marks should still be present and used). There are two "platforms" out there that use Unicode: Win32 and Java. They both use UCS-2, AFAIK. Cheers, -g -- Greg Stein, http://www.lyra.org/

Guido van Rossum <guido@CNRI.Reston.VA.US> wrote:
Marc-Andre writes: Unicode objects should have a pointer to a cached (read-only) char buffer <defencbuf> holding the object's value using the current <default encoding>. This is needed for performance and internal parsing (see below) reasons. The buffer is filled when the first conversion request to the <default encoding> is issued on the object. keeping track of an external encoding is better left for the application programmers -- I'm pretty sure that different application builders will want to handle this in radically different ways, depending on their environ- ment, underlying user interface toolkit, etc. besides, this is how Tcl would have done it. Python's not Tcl, and I think you need *very* good arguments for moving in that direction. </F>

Fredrik Lundh wrote:
It's not that hard to implement. All you have to do is check whether the current encoding in <defencbuf> still is the same as the threads view of <default encoding>. The <defencbuf> buffer is needed to implement "s" et al. argument parsing anyways.
-- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Just a couple observations from the peanut gallery... 1. I'm glad I don't have to do this Unicode/UTF/internationalization stuff. Seems like it would be easier to just get the whole world speaking Esperanto. 2. Are there plans for an internationalization session at IPC8? Perhaps a few key players could be locked into a room for a couple days, to emerge bloodied, but with an implementation in-hand... Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented...

"SM" == Skip Montanaro <skip@mojam.com> writes:
SM> 2. Are there plans for an internationalization session at SM> IPC8? Perhaps a few key players could be locked into a room SM> for a couple days, to emerge bloodied, but with an SM> implementation in-hand... I'm starting to think about devday topics. Sounds like an I18n session would be very useful. Champions? -Barry

Andy, Thanks a bundle for your case study and your toolkit proposal. It's interesting that you haven't touched upon internationalization of user interfaces (dialog text, menus etc.) -- that's a whole nother can of worms. Marc-Andre Lemburg has a proposal for work that I'm asking him to do (under pressure from HP who want Python i18n badly and are willing to pay!): http://starship.skyport.net/~lemburg/unicode-proposal.txt I think his proposal will go a long way towards your toolkit. I hope to hear soon from anybody who disagrees with Marc-Andre's proposal, because without opposition this is going to be Python 1.6's offering for i18n... (Together with a new Unicode regex engine by /F.) One specific question: in you discussion of typed strings, I'm not sure why you couldn't convert everything to Unicode and be done with it. I have a feeling that the answer is somewhere in your case study -- maybe you can elaborate? --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum writes:
The proposal seems reasonable to me.
(Together with a new Unicode regex engine by /F.)
This is good news! Would it be a from-scratch regex implementation, or would it be an adaptation of an existing engine? Would it involve modifications to the existing re module, or a completely new unicodere module? (If, unlike re.py, it has POSIX longest-match semantics, that would pretty much settle the question.) -- A.M. Kuchling http://starship.python.net/crew/amk/ All around me darkness gathers, fading is the sun that shone, we must speak of other matters, you can be me when I'm gone... -- The train's clattering, in SANDMAN #67: "The Kindly Ones:11"

[AMK]
The proposal seems reasonable to me.
Thanks. I really hope that this time we can move forward united...
It's from scratch, and I believe it's got Perl style, not POSIX style semantics -- per Tim Peters' recommendations. Do we need to open the discussion again? It involves a redone re module (supporting Unicode as well as 8-bit), but its API could be unchanged. /F does the parsing and compilation in Python, only the matching engine is in C -- not sure how that impacts performance, but I imagine with aggressive caching it would be okay. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum writes:
No, no; I'm actually happier with Perl-style, because it's far better documented and familiar to people. Worse *is* better, after all. My concern is simply that I've started translating re.py into C, and wonder how this affects the translation. This isn't a pressing issue, because the C version isn't finished yet.
Can I get my paws on a copy of the modified re.py to see what ramifications it has, or is this all still an unreleased work-in-progress? Doing the compilation in Python is a good idea, and will make it possible to implement alternative syntaxes. I would have liked to make it possible to generate PCRE bytecodes from Python, but what stopped me is the chance of bogus bytecode causing the engine to dump core, loop forever, or some other nastiness. (This is particularly important for code that uses rexec.py, because you'd expect regexes to be safe.) Fixing the engine to be stable when faced with bad bytecodes appears to require many additional checks that would slow down the common case of correct code, which is unappealing. -- A.M. Kuchling http://starship.python.net/crew/amk/ Anybody else on the list got an opinion? Should I change the language or not? -- Guido van Rossum, 28 Dec 91

On Tue, 9 Nov 1999, Andrew M. Kuchling wrote:
I would concur with the preference for Perl-style semantics. Aside from the issue of consistency with other scripting languages, i think it's easier to predict the behaviour of these semantics. You can run the algorithm in your head, and try the backtracking yourself. It's good for the algorithm to be predictable and well understood.
Doing the compilation in Python is a good idea, and will make it possible to implement alternative syntaxes.
Also agree. I still have some vague wishes for a simpler, more readable (more Pythonian?) way to express patterns -- perhaps not as powerful as full regular expressions, but useful for many simpler cases (an 80-20 solution). -- ?!ng

"AMK" == Andrew M Kuchling <akuchlin@mems-exchange.org> writes:
AMK> No, no; I'm actually happier with Perl-style, because it's AMK> far better documented and familiar to people. Worse *is* AMK> better, after all. Plus, you can't change re's semantics and I think it makes sense if the Unicode engine is as close semantically as possible to the existing engine. We need to be careful not to worsen performance for 8bit strings. I think we're already on the edge of acceptability w.r.t. P*** and hopefully we can /improve/ performance here. MAL's proposal seems quite reasonable. It would be excellent to see these things done for Python 1.6. There's still some discussion on supporting internationalization of applications, e.g. using gettext but I think those are smaller in scope. -Barry

Barry A. Warsaw writes: (in relation to support for Unicode regexes)
I don't think that will be a problem, given that the Unicode engine would be a separate C implementation. A bit of 'if type(strg) == UnicodeType' in re.py isn't going to cost very much speed. (Speeding up PCRE -- that's another question. I'm often tempted to rewrite pcre_compile to generate an easier-to-analyse parse tree, instead of its current complicated-but-memory-parsimonious compiler, but I'm very reluctant to introduce a fork like that.) -- A.M. Kuchling http://starship.python.net/crew/amk/ The world does so well without me, that I am moved to wish that I could do equally well without the world. -- Robertson Davies, _The Diary of Samuel Marchbanks_

Andrew M. Kuchling <akuchlin@mems-exchange.org> wrote:
any special pattern constructs that are in need of per- formance improvements? (compared to Perl, that is). or maybe anyone has an extensive performance test suite for perlish regular expressions? (preferrably based on how real people use regular expressions, not only on things that are known to be slow if not optimized) </F>

[Cc'ed to the String-SIG; sheesh, what's the point of having SIGs otherwise?] Fredrik Lundh writes:
any special pattern constructs that are in need of per- formance improvements? (compared to Perl, that is).
In the 1.5 source tree, I think one major slowdown is coming from the malloc'ed failure stack. This was introduced in order to prevent an expression like (x)* from filling the stack when applied to a string contained 50,000 'x' characters (hence 50,000 recursive function calls). I'd like to get rid of this stack because it's slow and requires much tedious patching of the upstream PCRE.
Friedl's book describes several optimizations which aren't implemented in PCRE. The problem is that PCRE never builds a parse tree, and parse trees are easy to analyse recursively. Instead, PCRE's functions actually look at the compiled byte codes (for example, look at find_firstchar or is_anchored in pypcre.c), but this makes analysis functions hard to write, and rearranging the code near-impossible. -- A.M. Kuchling http://starship.python.net/crew/amk/ I didn't say it was my fault. I said it was my responsibility. I know the difference. -- Rose Walker, in SANDMAN #60: "The Kindly Ones:4"

[Andrew M. Kuchling]
This is wonderfully & ironically Pythonic. That is, the Python compiler itself goes straight to byte code, and the optimization that's done works at the latter low level. Luckily <wink>, very little optimization is attempted, and what's there only replaces one bytecode with another of the same length. If it tried to do more, it would have to rearrange the code ... the-more-things-differ-the-more-things-don't-ly y'rs - tim

(a copy was sent to comp.lang.python by mistake; sorry for that). Andrew M. Kuchling <akuchlin@mems-exchange.org> wrote:
a slightly hairer design issue is what combinations of pattern and string the new 're' will handle. the first two are obvious: ordinary pattern, ordinary string unicode pattern, unicode string but what about these? ordinary pattern, unicode string unicode pattern, ordinary string "coercing" patterns (i.e. recompiling, on demand) seem to be a somewhat risky business ;-) </F>

[Guido, on "a new Unicode regex engine by /F"]
No, but I get to whine just a little <wink>: I didn't recommend either approach. I asked many futile questions about HP's requirements, and sketched implications either way. If HP *has* a requirement wrt POSIX-vs-Perl, it would be good to find that out before it's too late. I personally prefer POSIX semantics -- but, as Andrew so eloquently said, worse is better here; all else being equal it's best to follow JPython's Perl-compatible re lead. last-time-i-ever-say-what-i-really-think<wink>-ly y'rs - tim

Mark Hammond wrote:
Well almost... it depends on the current value of <default encoding>. If it's UTF8 and you only use normal ASCII characters the above is indeed true, but UTF8 can go far beyond ASCII and have up to 3 bytes per character (for UCS2, even more for UCS4). With <default encoding> set to other exotic encodings this is likely to fail though.
"U" is meant to simplify checks for Unicode objects, much like "S". It returns a reference to the object. Auto-conversions are not possible due to this, because they would create new objects which don't get properly garbage collected later on. Another problem is that Unicode types differ between platforms (MS VCLIB uses 16-bit wchar_t, while GLIBC2 uses 32-bit wchar_t). Depending on the internal format of Unicode objects this could mean calling different conversion APIs. BTW, I'm still not too sure about the underlying internal format. The problem here is that Unicode started out as 2-byte fixed length representation (UCS2) but then shifted towards a 4-byte fixed length reprensetation known as UCS4. Since having 4 bytes per character is hard sell to customers, UTF16 was created to stuff the UCS4 code points (this is how character entities are called in Unicode) into 2 bytes... with a variable length encoding. Some platforms that started early into the Unicode business such as the MS ones use UCS2 as wchar_t, while more recent ones (e.g. the glibc2 on Linux) use UCS4 for wchar_t. I haven't yet checked in what ways the two are compatible (I would suspect the top bytes in UCS4 being 0 for UCS2 codes), but would like to hear whether it wouldn't be a better idea to use UTF16 as internal format. The latter works in 2 bytes for most characters and conversion to UCS2|4 should be fast. Still, conversion to UCS2 could fail. The downside of using UTF16: it is a variable length format, so iterations over it will be slower than for UCS4. Simply sticking to UCS2 is probably out of the question, since Unicode 3.0 requires UCS4 and we are targetting Unicode 3.0. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Wed, 10 Nov 1999, M.-A. Lemburg wrote:
... Well almost... it depends on the current value of <default encoding>.
Default encodings are kind of nasty when they can be altered. The same problem occurred with import hooks. Only one can be present at a time. This implies that modules, packages, subsystems, whatever, cannot set a default encoding because something else might depend on it having a different value. In the end, nobody uses the default encoding because it is unreliable, so you end up with extra implementation/semantics that aren't used/needed. Have you ever noticed how Python modules, packages, tools, etc, never define an import hook? I'll bet nobody ever monkeys with the default encoding either... I say axe it and say "UTF-8" is the fixed, default encoding. If you want something else, then do that explicitly.
Exactly the reason to avoid wchar_t.
History is basically irrelevant. What is the situation today? What is in use, and what are people planning for right now?
Bzzt. May as well go with UTF-8 as the internal format, much like Perl is doing (as I recall). Why go with a variable length format, when people seem to be doing fine with UCS-2? Like I said in the other mail note: two large platforms out there are UCS-2 based. They seem to be doing quite well with that approach. If people truly need UCS-4, then they can work with that on their own. One of the major reasons for putting Unicode into Python is to increase/simplify its ability to speak to the underlying platform. Hey! Guess what? That generally means UCS2. If we didn't need to speak to the OS with these Unicode values, then people can work with the values entirely in Python, PyUnicodeType-be-damned. Are we digging a hole for ourselves? Maybe. But there are two other big platforms that have the same hole to dig out of *IF* it ever comes to that. I posit that it won't be necessary; that the people needing UCS-4 can do so entirely in Python. Maybe we can allow the encoder to do UCS-4 to UTF-8 encoding and vice-versa. But: it only does it from String to String -- you can't use Unicode objects anywhere in there.
Oh? Who says? Cheers, -g -- Greg Stein, http://www.lyra.org/

Greg Stein wrote: [MAL:]
Ehm, pardon me for asking - what is the brief rationale for selecting UCS2/4, or whetever it ends up being, over UTF8? I couldn't find a discussion in the last months of the string SIG, was this decided upon and frozen long ago? I'm not trying to re-open a can of worms, just to understand. -- Jean-Claude

Jean-Claude Wippler wrote:
UCS-2 is the native format on major platforms (meaning straight fixed length encoding using 2 bytes), ie. interfacing between Python's Unicode object and the platform APIs will be simple and fast. UTF-8 is short for ASCII users, but imposes a performance hit for the CJK (Asian character sets) world, since UTF8 uses *variable* length encodings. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Wed, 10 Nov 1999, Jean-Claude Wippler wrote:
Try sometime last year :-) ... something like July thru September as I recall. Things will be a lot faster if we have a fixed-size character. Variable length formats like UTF-8 are a lot harder to slice, search, etc. Also, (IMO) a big reason for this new type is for interaction with the underlying OS/platform. I don't know of any platforms right now that really use UTF-8 as their Unicode string representation (meaning we'd have to convert back/forth from our UTF-8 representation to talk to the OS). Cheers, -g -- Greg Stein, http://www.lyra.org/

[ Greg Stein]
The initial byte of any UTF-8 encoded character never appears in a *non*-initial position of any UTF-8 encoded character. Which means searching is not only tractable in UTF-8, but also that whatever optimized 8-bit clean string searching routines you happen to have sitting around today can be used as-is on UTF-8 encoded strings. This is not true of UCS-2 encoded strings (in which "the first" byte is not distinguished, so 8-bit search is vulnerable to finding a hit starting "in the middle" of a character). More, to the extent that the bulk of your text is plain ASCII, the UTF-8 search will run much faster than when using a 2-byte encoding, simply because it has half as many bytes to chew over. UTF-8 is certainly slower for random-access indexing, including slicing. I don't know what "etc" means, but if it follows the pattern so far, sometimes it's faster and sometimes it's slower <wink>.
No argument here.

Greg Stein <gstein@lyra.org> wrote:
Have you ever noticed how Python modules, packages, tools, etc, never define an import hook?
hey, didn't MAL use one in one of his mx kits? ;-)
I say axe it and say "UTF-8" is the fixed, default encoding. If you want something else, then do that explicitly.
exactly. modes are evil. python is not perl. etc.
last time I checked, there were no characters (even in the ISO standard) outside the 16-bit range. has that changed? </F>

Fredrik Lundh wrote:
Not yet, but I will unless my last patch ("walk me up, Scotty" - import) goes into the core interpreter.
But a requirement by the customer... they want to be able to set the locale on a per thread basis. Not exactly my preference (I think all locale settings should be passed as parameters, not via globals).
No, but people are already thinking about it and there is a defined range in the >16-bit area for private encodings (F0000..FFFFD and 100000..10FFFD). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Marc writes:
Sure - that is what this customer wants, but we need to be clear about the "best thing" for Python generally versus what this particular client wants. For example, if we went with UTF-8 as the only default encoding, then HP may be forced to use a helper function to perform the conversion, rather than the built-in functions. This helper function can use TLS (in Python) to store the encoding. At least it is localized. I agree that having a default encoding that can be changed is a bad idea. It may make 3 line scripts that need to print something easier to work with, but at the cost of reliability in large systems. Kinda like the existing "locale" support, which is thread specific, and is well known to cause these sorts of problems. The end result is that in your app, you find _someone_ has changed the default encoding, and some code no longer works. So the solution is to change the default encoding back, so _your_ code works again. You just know that whoever it was that changed the default encoding in the first place is now going to break - but what else can you do? Having a fixed, default encoding may make life slightly more difficult when you want to work primarily in a different encoding, but at least your system is predictable and reliable. Mark.
______________________________________________________________________

Tim Peters wrote:
See my other post on the subject... Note that if we make UTF-8 the standard encoding, nearly all special Latin-1 characters will produce UTF-8 errors on input and unreadable garbage on output. That will probably be unacceptable in Europe. To remedy this, one would *always* have to use u.encode('latin-1') to get readable output for Latin-1 strings repesented in Unicode. I'd rather see this happen the other way around: *always* explicitly state the encoding you want in case you rely on it, e.g. write file.write(u.encode('utf-8')) instead of file.write(u) # let's hope this goes out as UTF-8... Using the <default encoding> as site dependent setting is useful for convenience in those cases where the output format should be readable rather than parseable. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[MAL]
I think it's time for the Europeans to pronounce on what's acceptable in Europe. To the limited extent that I can pretend I'm Eurpoean, I'm happy with Guido's rebind-stdin/stdout-in-PYTHONSTARTUP idea.
By the same argument, those pesky Europeans who are relying on Latin-1 should write file.write(u.encode('latin-1')) instead of file.write(u) # let's hope this goes out as Latin-1
Well, "convenience" is always the argument advanced in favor of modes. Conflicts and nasty intermittent bugs are always the result. The latter will happen under Guido's idea too, as various careless modules rebind stdin & stdout to their own ideas of what "the proper" encoding should be. But at least the blame doesn't fall on the core language then <0.3 wink>. Since there doesn't appear to be anything (either or good or bad) you can do (or avoid) by using Guido's scheme instead of magical core thread state, there's no *need* for the latter. That is, it can be done with a user-level API without involving the core.

Tim Peters wrote:
Agreed.
Right.
Dito :-) I have nothing against telling people to take care about the problem in user space (meaning: not done by the core interpreter) and I'm pretty sure that HP will agree on this too, provided we give them the proper user space tools like file wrappers et al. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Mark Hammond wrote:
I think the discussion on this is getting a little too hot. The point is simply that the option of changing the per-thread default encoding is there. You are not required to use it and if you do you are on your own when something breaks. Think of it as a HP specific feature... perhaps I should wrap the code in #ifdefs and leave it undocumented. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Really - I see it as moving to a rational consensus that doesnt support the proposal in this regard. I see no heat in it at all. Im sorry if you saw my post or any of the followups as "emotional", but I certainly not getting passionate about this. I dont see any of this as affecting me personally. I believe that I can replace my Unicode implementation with this either way we go. Just because a we are trying to get it right doesnt mean we are getting heated.
Hrm - Im having serious trouble following your logic here. If make _any_ assumptions about a default encoding, I am in danger of breaking. I may not choose to change the default, but as soon as _anyone_ does, unrelated code may break. I agree that I will be "on my own", but I wont necessarily have been the one that changed it :-( The only answer I can see is, as you suggest, to ignore the fact that there is _any_ default. Always specify the encoding. But obviously this is not good enough for HP:
That would work - just ensure that no standard Python has those #ifdefs turned on :-) I would be sorely dissapointed if the fact that HP are throwing money for this means they get every whim implemented in the core language. Imagine the outcry if it were instead MS' money, and you were attempting to put an MS spin on all this. Are you writing a module for HP, or writing a module for Python that HP are assisting by providing some funding? Clear difference. IMO, it must also be seen that there is a clear difference. Maybe Im missing something. Can you explain why it is good enough everyone else to be required to assume there is no default encoding, but HP get their thread specific global? Are their requirements greater than anyone elses? Is everyone else not as important? What would you, as a consultant, recommend to people who arent HP, but have a similar requirement? It would seem obvious to me that HPs requirement can be met in "pure Python", thereby keeping this out of the core all together... Mark.

[per-thread defaults] C'mon guys, hasn't anyone ever played consultant before? The idea is obviously brain-dead. OTOH, they asked for it specifically, meaning they have some assumptions about how they think they're going to use it. If you give them what they ask for, you'll only have to fix it when they realize there are other ways of doing things that don't work with per-thread defaults. So, you find out why they think it's a good thing; you make it easy for them to code this way (without actually using per-thread defaults) and you don't make a fuss about it. More than likely, they won't either. "requirements"-are-only-useful-as-clues-to-the-objectives- behind-them-ly y'rs - Gordon

Mark Hammond wrote:
Naa... with "heated" I meant the "HP wants this, HP wants that" side of things. We'll just have to wait for their answer on this one.
Sure there are some very subtile dangers in setting the default to anything other than the default ;-) For some this risk may be worthwhile taking, for others not. In fact, in large projects I would never take such a risk... I'm sure we can get this message across to them.
Again, all I can try is convince them of not really needing settable default encodings. <IMO> Since this is the first time a Python Consortium member is pushing development, I think we can learn a lot here. For one, it should be clear that money doesn't buy everything, OTOH, we cannot put the whole thing at risk just because of some minor disagreement that cannot be solved between the parties. The standard solution for the latter should be a customized Python interpreter. </IMO> -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
hehe... funny you mention this. Go read the Consortium docs. Last time that I read them, there are no "parties" to reach consensus. *Every* technical decision regarding the Python language falls to the Technical Director (Guido, of course). I looked. I found nothing that can override the T.D.'s decisions and no way to force a particular decision. Guido is still the Benevolent Dictator :-) Cheers, -g p.s. yes, there is always the caveat that "sure, Guido has final say" but "Al can fire him at will for being too stubborn" :-) ... but hey, Guido's title does have the word Benevolent in it, so things are cool... -- Greg Stein, http://www.lyra.org/

Greg Stein wrote:
Sure, but have you considered the option of a member simply bailing out ? HP could always stop funding Unicode integration. That wouldn't help us either...
-- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
I'm not that dumb... come on. That was my whole point about "Benevolent" below... Guido is a fair and reasonable Dictator... he wouldn't let that happen.
Cheers, -g -- Greg Stein, http://www.lyra.org/

On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
It's a lot easier to just never provide the rope (per-thread default encodings) in the first place. If the feature exists, then it will be used. Period. Try to get the message across until you're blue in the face, but it would be used. Anyhow... discussion is pretty moot until somebody can state that it is/isn't a "real requirement" and/or until The Guido takes a position. Cheers, -g -- Greg Stein, http://www.lyra.org/

On Thu, 11 Nov 1999, Mark Hammond wrote:
Ha! I was getting ready to say exactly the same thing. Are building Python for a particular customer, or are we building it to Do The Right Thing? I've been getting increasingly annoyed at "well, HP says this" or "HP wants that." I'm ecstatic that they are a Consortium member and are helping to fund the development of Python. However, if that means we are selling Python's soul to corporate wishes rather than programming and design ideals... well, it reduces my enthusiasm :-)
Yes! Yes! Example #2. My first example (import hooks) was shrugged off by some as "well, nobody uses those." Okay, maybe people don't use them (but I believe that is *because* of this kind of problem). In Mark's example, however... this is a definite problem. I ran into this when I was building some code for Microsoft Site Server. IIS was setting a different locale on my thread -- one that I definitely was not expecting. All of a sudden, strlwr() no longer worked as I expected -- certain characters didn't get lower-cased, so my dictionary lookups failed because the keys were not all lower-cased. Solution? Before passing control from C++ into Python, I set the locale to the default locale. Restored it on the way back out. Extreme measures, and costly to do, but it had to be done. I think I'll pick up Fredrik's phrase here... (chanting) "Modes Are Evil!" "Modes Are Evil!" "Down with Modes!" :-)
*bing* I'm with Mark on this one. Global modes and state are a serious pain when it comes to developing a system. Python is very amenable to utility functions and classes. Any "customer" can use a utility function to manually do the encoding according to a per-thread setting stashed in some module-global dictionary (map thread-id to default-encoding). Done. Keep it out of the interpreter... Cheers, -g -- Greg Stein, http://www.lyra.org/

On Thu, 11 Nov 1999, Greg Stein wrote:
What about just explaining the rationale for the default-less point of view to whoever is in charge of this at HP and see why they came up with their rationale in the first place? They might have a good reason, or they might be willing to change said requirement. --david

Damn, you're smooth... maybe you should have run for SF Mayor... :-) On Wed, 10 Nov 1999, David Ascher wrote:
-- Greg Stein, http://www.lyra.org/

[/F]
last time I checked, there were no characters (even in the ISO standard) outside the 16-bit range. has that changed?
[MAL]
Over the decades I've developed a rule of thumb that has never wound up stuck in my ass <wink>: If I engineer code that I expect to be in use for N years, I make damn sure that every internal limit is at least 10x larger than the largest I can conceive of a user making reasonable use of at the end of those N years. The invariable result is that the N years pass, and fewer than half of the users have bumped into the limit <0.5 wink>. At the risk of offending everyone, I'll suggest that, qualitatively speaking, Unicode is as Eurocentric as ASCII is Anglocentric. We've just replaced "256 characters?! We'll *never* run out of those!" with 64K. But when Asian languages consume them 7K at a pop, 64K isn't even in my 10x comfort range for some individual languages. In just a few months, Unicode 3 will already have used up > 56K of the 64K slots. As I understand it, UTF-16 "only" adds 1M new code points. That's in my 10x zone, for about a decade. predicting-we'll-live-to-regret-it-either-way-ly y'rs - tim

Tim Peters wrote:
If HP approves, I'd propose to use UTF-16 as if it were UCS-2 and signal failure of this assertion at Unicode object construction time via an exception. That way we are within the standard, can use reasonably fast code for Unicode manipulation and add those extra 1M character at a later stage. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[MAL]
I think this is reasonable. Using UTF-8 internally is also reasonable, and if it's being rejected on the grounds of supposed slowness, that deserves a closer look (it's an ingenious encoding scheme that works correctly with a surprising number of existing 8-bit string routines as-is). Indexing UTF-8 strings is greatly speeded by adding a simple finger (i.e., store along with the string an index+offset pair identifying the most recent position indexed to -- since string indexing is overwhelmingly sequential, this makes most indexing constant-time; and UTF-8 can be scanned either forward or backward from a random internal point because "the first byte" of each encoding is recognizable as such). I expect either would work well. It's at least curious that Perl and Tcl both went with UTF-8 -- does anyone think they know *why*? I don't. The people here saying UCS-2 is the obviously better choice are all from the Microsoft camp <wink>. It's not obvious to me, but then neither do I claim that UTF-8 is obviously better.

Tim Peters wrote:
Here are some arguments for using the proposed UTF-16 strategy instead: · all characters have the same length; indexing is fast · conversion APIs to platform dependent wchar_t implementation are fast because they either can simply copy the content or extend the 2-bytes to 4 byte · UTF-8 needs 2 bytes for all the compound Latin-1 characters (e.g. u with two dots) which are used in many non-English languages · from the Unicode Consortium FAQ: "Most Unicode APIs are using UTF-16." Besides, the Unicode object will have a buffer containing the <default encoding> representation of the object, which, if all goes well, will always hold the UTF-8 value. RE engines etc. can then directly work with this buffer.
-- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

<rant> over my dead body, that one... (fwiw, over the last 20 years, I've implemented about a dozen image processing libraries, supporting loads of pixel layouts and file formats. one important lesson from that is to stick to a single internal representation, and let the application programmers build their own layers if they need to speed things up -- yes, they're actually happier that way. and text strings are not that different from pixel buffers or sound streams or scientific data sets, after all...) (and sticks and modes will break your bones, but you know that...)
RE engines etc. can then directly work with this buffer.
sidebar: the RE engine that's being developed for this project can handle 8-bit, 16-bit, and (optionally) 32-bit text buffers. a single compiled expression can be used with any character size, and performance is about the same for all sizes (at least on any decent cpu).
(hey, I'm not a microsofter. but I've been writing "i/o libraries" for various "object types" all my life, so I do have strong preferences on what works, and what doesn't... I use Python for good reasons, you know ;-) </rant> thanks. I feel better now. </F>

Fredrik Lundh wrote:
Such a buffer is needed to implement "s" and "s#" argument parsing. It's a simple requirement to support those two parsing markers -- there's not much to argue about, really... unless, of course, you want to give up Unicode object support for all APIs using these parsers. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Fredrik Lundh wrote:
If we don't add that support, lot's of existing APIs won't accept Unicode object instead of strings. While it could be argued that automatic conversion to UTF-8 is not transparent enough for the user, the other solution of using str(u) everywhere would probably make writing Unicode-aware code a rather clumsy task and introduce other pitfalls, since str(obj) calls PyObject_Str() which also works on integers, floats, etc. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
No no no... "s" and "s#" are NOT SUPPOSED TO return a UTF-8 encoding. They are supposed to return the raw bytes. If a caller wants 8-bit characters, then that caller will use "t#". If you want to argue for that separate, encoded buffer, then argue for it for support for the "t#" format. But do NOT say that it is needed for "s#" which simply means "give me some bytes." -g -- Greg Stein, http://www.lyra.org/

Greg Stein wrote:
[I've waited quite some time for you to chime in on this one ;-)] Let me summarize a bit on the general ideas behind "s", "s#" and the extra buffer: First, we have a general design question here: should old code become Unicode compatible or not. As I recall the original idea about Unicode integration was to follow Perl's idea to have scripts become Unicode aware by simply adding a 'use utf8;'. If this is still the case, then we'll have to come with a resonable approach for integrating classical string based APIs with the new type. Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g. the Latin-1 folks) which has some very nice features (see http://czyborra.com/utf/ ) and which is a true extension of ASCII, this encoding seems best fit for the purpose. However, one should not forget that UTF-8 is in fact a variable length encoding of Unicode characters, that is up to 3 bytes form a *single* character. This is obviously not compatible with definitions that explicitly state data to be using a 8-bit single character encoding, e.g. indexing in UTF-8 doesn't work like it does in Latin-1 text. So if we are to do the integration, we'll have to choose argument parser markers that allow for multi byte characters. "t#" does not fall into this category, "s#" certainly does, "s" is argueable. Also note that we have to watch out for embedded NULL bytes. UTF-16 has NULL bytes for every character from the Latin-1 domain. If "s" were to give back a pointer to the internal buffer which is encoded in UTF-16, you would loose data. UTF-8 doesn't have this problem, since only NULL bytes map to (single) NULL bytes. Now Greg would chime in with the buffer interface and argue that it should make the underlying internal format accessible. This is a bad idea, IMHO, since you shouldn't really have to know what the internal data format is. Defining "s#" to return UTF-8 data does not only make "s" and "s#" return the same data format (which should always be the case, IMO), but also hides the internal format from the user and gives him a reliable cross-platform data representation of Unicode data (note that UTF-8 doesn't have the byte order problems of UTF-16). If you are still with, let's look at what "s" and "s#" do: they return pointers into data areas which have to be kept alive until the corresponding object dies. The only way to support this feature is by allocating a buffer for just this purpose (on the fly and only if needed to prevent excessive memory load). The other options of adding new magic parser markers or switching to more generic one all have one downside: you need to change existing code which is in conflict with the idea we started out with. So, again, the question is: do we want this magical intergration or not ? Note that this is a design question, not one of memory consumption... -- Ok, the above covered Unicode -> String conversion. Mark mentioned that he wanted the other way around to also work in the same fashion, ie. automatic String -> Unicode conversion. This could also be done in the same way by interpreting the string as UTF-8 encoded Unicode... but we have the same problem: where to put the data without generating new intermediate objects. Since only newly written code will use this feature there is a way to do this though: PyArg_ParseTuple(args,"s#",&utf8,&len); If your C API understands UTF-8 there's nothing more to do, if not, take Greg's option 3 approach: PyArg_ParseTuple(args,"O",&obj); unicode = PyUnicode_FromObject(obj); ... Py_DECREF(unicode); Here PyUnicode_FromObject() will return a new reference if obj is an Unicode object or create a new Unicode object by interpreting str(obj) as UTF-8 encoded string. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 48 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

I think I have a reasonable grasp of the issues here, even though I still haven't read about 100 msgs in this thread. Note that t# and the charbuffer addition to the buffer API were added by Greg Stein with my support; I'll attempt to reconstruct our thinking at the time... [MAL]
Let me summarize a bit on the general ideas behind "s", "s#" and the extra buffer:
I think you left out t#.
I've never heard of this idea before -- or am I taking it too literal? It smells of a mode to me :-) I'd rather live in a world where Unicode just works as long as you use u'...' literals or whatever convention we decide.
Yes, especially if we fix the default encoding as UTF-8. (I'm expecting feedback from HP on this next week, hopefully when I see the details, it'll be clear that don't need a per-thread default encoding to solve their problems; that's quite a likely outcome. If not, we have a real-world argument for allowing a variable default encoding, without carnage.)
Sure, but where in current Python are there such requirements?
I disagree. I grepped through the source for s# and t#. Here's a bit of background. Before t# was introduced, s# was being used for two distinct purposes: (1) to get an 8-bit text string plus its length, in situations where the length was needed; (2) to get binary data (e.g. GIF data read from a file in "rb" mode). Greg pointed out that if we ever introduced some form of Unicode support, these two had to be disambiguated. We found that the majority of uses was for (2)! Therefore we decided to change the definition of s# to mean only (2), and introduced t# to mean (1). Also, we introduced getcharbuffer corresponding to t#, while getreadbuffer was meant for s#. Note that the definition of the 's' format was left alone -- as before, it means you need an 8-bit text string not containing null bytes. Our expectation was that a Unicode string passed to an s# situation would give a pointer to the internal format plus a byte count (not a character count!) while t# would get a pointer to some kind of 8-bit translation/encoding plus a byte count, with the explicit requirement that the 8-bit translation would have the same lifetime as the original unicode object. We decided to leave it up to the next generation (i.e., Marc-Andre :-) to decide what kind of translation to use and what to do when there is no reasonable translation. Any of the following choices is acceptable (from the point of view of not breaking the intended t# semantics; we can now start deciding which we like best): - utf-8 - latin-1 - ascii - shift-jis - lower byte of unicode ordinal - some user- or os-specified multibyte encoding As far as t# is concerned, for encodings that don't encode all of Unicode, untranslatable characters could be dealt with in any number of ways (raise an exception, ignore, replace with '?', make best effort, etc.). Given the current context, it should probably be the same as the default encoding -- i.e., utf-8. If we end up making the default user-settable, we'll have to decide what to do with untranslatable characters -- but that will probably be decided by the user too (it would be a property of a specific translation specification). In any case, I feel that t# could receive a multi-byte encoding, s# should receive raw binary data, and they should correspond to getcharbuffer and getreadbuffer, respectively. (Aside: the symmetry between 's' and 's#' is now lost; 's' matches 't#', there's no match for 's#'.)
This is a red herring given my explanation above.
This is for C code. Quite likely it *does* know what the internal data format is!
That was before t# was introduced. No more, alas. If you replace s# with t#, I agree with you completely.
(and t#, which is more relevant here)
Agreed. I think this was our thinking when Greg & I introduced t#. My own preference would be to allocate a whole string object, not just a buffer; this could then also be used for the .encode() method using the default encoding.
Yes, I want it. Note that this doesn't guarantee that all old extensions will work flawlessly when passed Unicode objects; but I think that it covers most cases where you could have a reasonable expectation that it works. (Hm, unfortunately many reasonable expectations seem to involve the current user's preferred encoding. :-( )
No! That is supposed to give the native representation of the string object. I agree that Mark's problem requires a solution too, but it doesn't have to use existing formatting characters, since there's no backwards compatibility issue.
This might work. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
On purpose -- according to my thinking. I see "t#" as an interface to bf_getcharbuf which I understand as 8-bit character buffer... UTF-8 is a multi byte encoding. It still is character data, but not necessarily 8 bits in length (up to 24 bits are used). Anyway, I'm not really interested in having an argument about this. If you say, "t#" fits the purpose, then that's fine with me. Still, we should clearly define that "t#" returns text data and "s#" binary data. Encoding, bit length, etc. should explicitly remain left undefined.
Fair enough :-)
It was my understanding that "t#" refers to single byte character data. That's where the above arguments were aiming at...
I know its too late now, but I can't really follow the arguments here: in what ways are (1) and (2) different from the implementations point of view ? If "t#" is to return UTF-8 then <length of the buffer> will not equal <text length>, so both parser markers return essentially the same information. The only difference would be on the semantic side: (1) means: give me text data, while (2) does not specify the data type. Perhaps I'm missing something...
This definition should then be changed to "text string without null bytes" dropping the 8-bit reference.
Hmm, I would strongly object to making "s#" return the internal format. file.write() would then default to writing UTF-16 data instead of UTF-8 data. This could result in strange errors due to the UTF-16 format being endian dependent. It would also break the symmetry between file.write(u) and unicode(file.read()), since the default encoding is not used as internal format for other reasons (see proposal).
I think we have already agreed on using UTF-8 for the default encoding. It has quite a few advantages. See http://czyborra.com/utf/ for a good overview of the pros and cons.
The usual Python way would be: raise an exception. This is what the proposal defines for Codecs in case an encoding/decoding mapping is not possible, BTW. (UTF-8 will always succeed on output.)
Why would you want to have "s#" return the raw binary data for Unicode objects ? Note that it is not mentioned anywhere that "s#" and "t#" do have to necessarily return different things (binary being a superset of text). I'd opt for "s#" and "t#" both returning UTF-8 data. This can be implemented by delegating the buffer slots to the <defencstr> object (see below).
C code can use the PyUnicode_* APIs to access the data. I don't think that argument parsing is powerful enough to provide the C code with enough information about the data contents, e.g. it can only state the encoding length, not the string length.
Done :-)
Good point. I'll change <defencbuf> to <defencstr>, a Python string object created on request.
-- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 47 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Thanks for not picking an argument. Multibyte encodings typically have ASCII as a subset (in such a way that an ASCII string is represented as itself in bytes). This is the characteristic that's needed in my view.
t# refers to byte-encoded data. Multibyte encodings are explicitly designed to be passed cleanly through processing steps that handle single-byte character data, as long as they are 8-bit clean and don't do too much processing.
The idea is that (1)/s# disallows any translation of the data, while (2)/t# requires translation of the data to an ASCII superset (possibly multibyte, such as UTF-8 or shift-JIS). (2)/t# assumes that the data contains text and that if the text consists of only ASCII characters they are represented as themselves. (1)/s# makes no such assumption. In terms of implementation, Unicode objects should translate themselves to the default encoding for t# (if possible), but they should make the native representation available for s#. For example, take an encryption engine. While it is defined in terms of byte streams, there's no requirement that the bytes represent characters -- they could be the bytes of a GIF file, an MP3 file, or a gzipped tar file. If we pass Unicode to an encryption engine, we want Unicode to come out at the other end, not UTF-8. (If we had wanted to encrypt UTF-8, we should have fed it UTF-8.)
Aha, I think there's a confusion about what "8-bit" means. For me, a multibyte encoding like UTF-8 is still 8-bit. Am I alone in this? (As far as I know, C uses char* to represent multibyte characters.) Maybe we should disambiguate it more explicitly?
But this was the whole design. file.write() needs to be changed to use s# when the file is open in binary mode and t# when the file is open in text mode.
If the file is encoded using UTF-16 or UCS-2, you should open it in binary mode and use unicode(file.read(), 'utf-16'). (Or perhaps the app should read the first 2 bytes and check for a BOM and then decide to choose bewteen 'utf-16-be' and 'utf-16-le'.)
Of course. I was just presenting the list as an argument that if we changed our mind about the default encoding, t# should follow the default encoding (and not pick an encoding by other means).
Did you read Andy Robinson's case study? He suggested that for certain encodings there may be other things you can do that are more user-friendly than raising an exception, depending on the application. I am proposing to leave this a detail of each specific translation. There may even be translations that do the same thing except they have a different behavior for untranslatable cases -- e.g. a strict version that raises an exception and a non-strict version that replaces bad characters with '?'. I think this is one of the powers of having an extensible set of encodings.
Because file.write() for a binary file, and other similar things (e.g. the encryption engine example I mentioned above) must have *some* way to get at the raw bits.
This would defeat the whole purpose of introducing t#. We might as well drop t# then altogether if we adopt this.
Typically, all the C code does is pass multibyte encoded strings on to other library routines that know what to do to them, or simply give them back unchanged at a later time. It is essential to know the number of bytes, for memory allocation purposes. The number of characters is totally immaterial (and multibyte-handling code knows how to calculate the number of characters anyway).
--Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not "8-bit clean" as you obviously did.
There should be some definition for the two markers and the ideas behind them in the API guide, I guess.
Ok, that would make the situation a little clearer (even though I expect the two different encodings to produce some FAQs). I still don't feel very comfortable about the fact that all existing APIs using "s#" will suddenly receive UTF-16 data if being passed Unicode objects: this probably won't get us the "magical" Unicode integration we invision, since "t#" usage is not very wide spread and character handling code will probably not work well with UTF-16 encoded strings. Anyway, we should probably try out both methods...
Right, that's the idea (there is a note on this in the Standard Codec section of the proposal).
Ok.
Agreed, the Codecs should decide for themselves what to do. I'll add a note to the next version of the proposal.
What for ? Any lossless encoding should do the trick... UTF-8 is just as good as UTF-16 for binary files; plus it's more compact for ASCII data. I don't really see a need to get explicitly at the internal data representation because both encodings are in fact "internal" w/r to Unicode objects. The only argument I can come up with is that using UTF-16 for binary files could (possibly) eliminate the UTF-8 conversion step which is otherwise always needed.
Well... yes ;-)
-- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Mon, 15 Nov 1999, M.-A. Lemburg wrote:
Hrm. That might be dangerous. Many of the functions that use "t#" assume that each character is 8-bits long. i.e. the returned length == the number of characters. I'm not sure what the implications would be if you interpret the semantics of "t#" as multi-byte characters.
Heck. I just want to quickly throw the data onto my disk. I'll write a BOM, following by the raw data. Done. It's even portable.
Maybe. I don't see multi-byte characters as 8-bit (in the sense of the "t" format).
(As far as I know, C uses char* to represent multibyte characters.) Maybe we should disambiguate it more explicitly?
We can disambiguate with a new format character, or we can clarify the semantics of "t" to mean single- *or* multi- byte characters. Again, I think there may be trouble if the semantics of "t" are defined to allow multibyte characters.
There should be some definition for the two markers and the ideas behind them in the API guide, I guess.
Certainly. [ man, I'm bad... I've got doc updates there and for the buffer stuff :-( ]
Interesting idea, but that presumes that "t" will be defined for the Unicode object (i.e. it implements the getcharbuffer type slot). Because of the multi-byte problem, I don't think it will. [ not to mention, that I don't think the Unicode object should implicitly do a UTF-8 conversion and hold a ref to the resulting string ]
I'm not sure that we should definitely go for "magical." Perl has magic in it, and that is one of its worst faults. Go for clean and predictable, and leave as much logic to the Python level as possible. The interpreter should provide a minimum of functionality, rather than second-guessing and trying to be neat and sneaky with its operation.
How about: "because I'm the application developer, and I say that I want the raw bytes in the file."
The argument that I come up with is "don't tell me how to design my storage format, and don't make Python force me into one." If I want to write Unicode text to a file, the most natural thing to do is: open('file', 'w').write(u) If you do a conversion on me, then I'm not writing Unicode. I've got to go and do some nasty conversion which just monkeys up my program. If I have a Unicode object, but I *want* to write UTF-8 to the file, then the cleanest thing is: open('file', 'w').write(encode(u, 'utf-8')) This is clear that I've got a Unicode object input, but I'm writing UTF-8. I have a second argument, too: See my first argument. :-) Really... this is kind of what Fredrik was trying to say: don't get in the way of the application programmer. Give them tools, but avoid policy and gimmicks and other "magic". Cheers, -g -- Greg Stein, http://www.lyra.org/

Greg Stein wrote:
FYI, the next version of the proposal now says "s#" gives you UTF-16 and "t#" returns UTF-8. File objects opened in text mode will use "t#" and binary ones use "s#". I'll just use explicit u.encode('utf-8') calls if I want to write UTF-8 to binary files -- perhaps everyone else should too ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Good.
I'll just use explicit u.encode('utf-8') calls if I want to write UTF-8 to binary files -- perhaps everyone else should too ;-)
You could write UTF-8 to files opened in text mode too; at least most actual systems will leave the UTF-8 escapes alone and just to LF -> CRLF translation, which should be fine. --Guido van Rossum (home page: http://www.python.org/~guido/)

[MAL]
FYI, the next version of the proposal ... File objects opened in text mode will use "t#" and binary ones use "s#".
Am I the only one who sees magical distinctions between text and binary mode as a Really Bad Idea? I wouldn't have guessed the Unix natives here would quietly acquiesce to importing a bit of Windows madness <wink>.

On Wed, 17 Nov 1999, Tim Peters wrote:
It's a seductive idea... yes, it feels wrong, but then... it seems kind of right, too... :-) Yes. It is a mode. Is it bad? Not sure. You've already told the system that you want to treat the file differently. Much like you're treating it differently when you specify 'r' vs. 'w'. The real annoying thing would be to assume that opening a file as 'r' means that I *meant* text mode and to start using "t#". In actuality, I typically open files that way since I do most of my coding on Linux. If I now have to pay attention to things and open it as 'rb', then I'll be pissed. And the change in behavior and bugs that interpreting 'r' as text would introduce? Ack! Cheers, -g -- Greg Stein, http://www.lyra.org/

[MAL]
File objects opened in text mode will use "t#" and binary ones use "s#".
[Greg Stein]
Isn't that exactly what MAL said would happen? Note that a "t" flag for "text mode" is an MS extension -- C doesn't define "t", and Python doesn't either; a lone "r" has always meant text mode.
'r' is already intepreted as text mode, but so far, on Unix-like systems, there's been no difference between text and binary modes. Introducing a distinction will certainly cause problems. I don't know what the compensating advantages are thought to be.

On Wed, 17 Nov 1999, Tim Peters wrote:
Wow. "compensating advantages" ... Excellent "power phrase" there. hehe... -g -- Greg Stein, http://www.lyra.org/

Tim Peters wrote:
Em, I think you've got something wrong here: "t#" refers to the parsing marker used for writing data to files opened in text mode. Until now, all files used the "s#" parsing marker for writing data, regardeless of being opened in text or binary mode. The new interpretation (new, because there previously was none ;-) of the buffer interface forces this to be changed to regain conformance.
I guess you won't notice any difference: strings define both interfaces ("s#" and "t#") to mean the same thing. Only other buffer compatible types may now fail to write to text files -- which is not so bad, because it forces the programmer to rethink what he really intended when opening the file in text mode. Besides, if you are writing portable scripts you should pay close attention to "r" vs. "rb" anyway. [Strange, I find myself argueing for a feature that I don't like myself ;-)] -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Thu, 18 Nov 1999, M.-A. Lemburg wrote:
Nope. We've got it right :-) Tim and I used 'r' and "t" to refer to file-open modes. I used "t#" to refer to the parse marker.
It *is* bad if it breaks my existing programs in subtle ways that are a bitch to track down.
Besides, if you are writing portable scripts you should pay close attention to "r" vs. "rb" anyway.
I'm not writing portable scripts. I mentioned that once before. I don't want a difference between 'r' and 'rb' on my Linux box. It was never there before, I'm lazy, and I don't want to see it added :-). Honestly, I don't know offhand of any Python types that repond to "s#" and "t#" in different ways, such that changing file.write would end up writing something different (and thereby breaking existing code). I just don't like introduce text/binary to *nix platforms where it didn't exist before. Cheers, -g -- Greg Stein, http://www.lyra.org/

Greg> I'm not writing portable scripts. I mentioned that once before. I Greg> don't want a difference between 'r' and 'rb' on my Linux box. It Greg> was never there before, I'm lazy, and I don't want to see it added Greg> :-). ... Greg> I just don't like introduce text/binary to *nix platforms where it Greg> didn't exist before. I'll vote with Greg, Guido's cross-platform conversion not withstanding. If I haven't been writing portable scripts up to this point because I only care about a single target platform, why break my scripts for me? Forcing me to use "rb" or "wb" on my open calls isn't going to make them portable anyway. There are probably many other harder to identify and correct portability issues than binary file access anyway. Seems like requiring "b" is just going to cause gratuitous breakage with no obvious increase in portability. porta-nanny.py-anyone?-ly y'rs, Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented...

Greg Stein wrote:
Ah, ok. But "t" as file opener is non-portable anyways, so I'll skip it here :-)
Please remember that up until now you were probably only using strings to write to files. Python strings don't differentiate between "t#" and "s#" so you wont see any change in function or find subtle errors being introduced. If you are already using the buffer feature for e.g. array which also implement "s#" but don't support "t#" for obvious reasons you'll run into trouble, but then: arrays are binary data, so changing from text mode to binary mode is well worth the effort even if you just consider it a nuisance. Since the buffer interface and its consequences haven't published yet, there are probably very few users out there who would actually run into any problems. And even if they do, its a good chance to catch subtle bugs which would only have shown up when trying to port to another platform. I'll leave the rest for Guido to answer, since it was his idea ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[MAL]
Breaking existing code that works should be considered more than a nuisance. However, one answer would be to have "t#" _prefer_ to use the text buffer, but not insist on it. eg, the logic for processing "t#" could check if the text buffer is supported, and if not move back to the blob buffer. This should mean that all existing code still works, except for objects that support both buffers to mean different things. AFAIK there are no objects that qualify today, so it should work fine. Unix users _will_ need to revisit their thinking about "text mode" vs "binary mode" when writing these new objects (such as Unicode), but IMO that is more than reasonable - Unix users dont bother qualifying the open mode of their files, simply because it has no effect on their files. If for certain objects or requirements there _is_ a distinction, then new code can start to think these issues through. "Portable File IO" will simply be extended from simply "portable among all platforms" to "portable among all platforms and objects". Mark.

Mark Hammond wrote:
Its an error that pretty easy to fix... that's what I was referring to with "nuisance". All you have to do is open the file in binary mode and you're done. BTW, the change will only effect platforms that don't differ between text and binary mode, e.g. Unix ones.
I doubt that this is conform to what the buffer interface want's to reflect: if the getcharbuf slot is not implemented this means "I am not text". If you would write non-text to a text file, this may cause line breaks to be interpreted in ways that are incompatible with the binary data, i.e. when you read the data back in, it may fail to load because e.g. '\n' was converted to '\r\n'.
Well, even though the code would work, it might break badly someday for the above reasons. Better fix that now when there aren't too many possible cases around than at some later point where the user has to figure out the problem for himself due to the system not warning him about this.
Right. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 42 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

>> FYI, the next version of the proposal ... File objects opened in >> text mode will use "t#" and binary ones use "s#". Tim> Am I the only one who sees magical distinctions between text and Tim> binary mode as a Really Bad Idea? No. Tim> I wouldn't have guessed the Unix natives here would quietly Tim> acquiesce to importing a bit of Windows madness <wink>. We figured you and Guido would come to our rescue... ;-) Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented...

Don't count on me. My brain is totally cross-platform these days, and writing "rb" or "wb" for files containing binary data is second nature for me. I actually *like* it. Anyway, the Unicode stuff ought to have a wrapper open(filename, mode, encoding) where the 'b' will be added to the mode if you don't give it and it's needed. --Guido van Rossum (home page: http://www.python.org/~guido/)

Hrm. Can you quote examples of users of t# who would be confused by multibyte characters? I guess that there are quite a few places where they will be considered illegal, but that's okay -- the string will be parsed at some point and rejected, e.g. as an illegal filename, hostname or whatever. On the other hand, there are quite some places where I would think that multibyte characters would do just the right thing. Many places using t# could just as well be using 's' except they need to know the length and they don't want to call strlen(). In all cases I've looked at, the reason they need the length because they are allocating a buffer (or checking whether it fits in a statically allocated buffer) -- and there the number of bytes in a multibyte string is just fine. Note that I take the same stance on 's' -- it should return multibyte characters.
Here I'm with you, man!
Greg Stein, http://www.lyra.org/
--Guido van Rossum (home page: http://www.python.org/~guido/)

Greg Stein writes:
[ man, I'm bad... I've got doc updates there and for the buffer stuff :-( ]
And the sooner I receive them, the sooner they can be integrated! Any plans to get them to me? I'll probably want to do another release before the IPC8. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

M.-A. Lemburg writes:
Perhaps I missed the agreement that these should always receive UTF-8 from Unicode strings. Was this agreed upon, or has it simply not been argued over in favor of other topics? If this has indeed been agreed upon... at least it can be computed on demand rather than at initialization! Perhaps there should be two pointers: one to the UTF-8 buffer and one to a PyObject; if the PyObject is there it's a "old-style" string that's actually providing the buffer. This may or may not be a good idea; there's a lot of memory expense for long Unicode strings converted from UTF-8 that aren't ever converted back to UTF-8 or accessed using "s" or "s#". Ok, I've talked myself out of that. ;-) -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

Fred L. Drake, Jr. <fdrake@acm.org> wrote:
from unicode import * def getname(): # hidden in some database engine, or so... return unicode("Linköping", "iso-8859-1") ... name = getname() # emulate automatic conversion to utf-8 name = str(name) # print it in uppercase, in the usual way import string print string.upper(name) ## LINKöPING I don't know, but I think that I think that it perhaps should raise an exception instead... </F>

"Fred L. Drake, Jr." wrote:
It's been in the proposal since version 0.1. The idea is to provide a decent way of making existing script Unicode aware.
If this has indeed been agreed upon... at least it can be computed on demand rather than at initialization!
This is what I intended to implement. The <defencbuf> buffer will be filled upon the first request to the UTF-8 encoding. "s" and "s#" are examples of such requests. The buffer will remain intact until the object is destroyed (since other code could store the pointer received via e.g. "s").
Note that Unicode object are completely different beast ;-) String object are not touched in any way by the proposal. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg writes:
It's been in the proposal since version 0.1. The idea is to provide a decent way of making existing script Unicode aware.
Ok, so I haven't read closely enough.
Right.
Note that Unicode object are completely different beast ;-) String object are not touched in any way by the proposal.
I wasn't suggesting the PyStringObject be changed, only that the PyUnicodeObject could maintain a reference. Consider: s = fp.read() u = unicode(s, 'utf-8') u would now hold a reference to s, and s/s# would return a pointer into s instead of re-building the UTF-8 form. I talked myself out of this because it would be too easy to keep a lot more string objects around than were actually needed. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

"Fred L. Drake, Jr." wrote:
Agreed. Also, the encoding would always be correct. <defencbuf> will always hold the <default encoding> version (which should be UTF-8...). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Tim Peters writes:
Yet another use for a weak reference <0.5 wink>.
Those just keep popping up! I seem to recall Diane Hackborne actually implemented these under the name "vref" long ago; perhaps that's worth revisiting after all? (Not the implementation so much as the idea.) I think to make it general would cost one PyObject* in each object's structure, and some code in some constructors (maybe), and all destructors, but not much. Is this worth pursuing, or is it locked out of the core because of the added space for the PyObject*? (Note that the concept isn't necessarily useful for all object types -- numbers in particular -- but it only makes sense to bother if it works for everything, even if it's not very useful in some cases.) -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

"Fred L. Drake, Jr." wrote:
FYI, there's mxProxy which implements a flavor of them. Look in the standard places for mx stuff ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg writes:
FYI, there's mxProxy which implements a flavor of them. Look in the standard places for mx stuff ;-)
Yes, but still not in the core. So we have two general examples (vrefs and mxProxy) and there's WeakDict (or something like that). I think there really needs to be a core facility for this. There are a lot of users (including myself) who think that things are far less useful if they're not in the core. (No, I'm not saying that everything should be in the core, or even that it needs a lot more stuff. I just don't want to be writing code that requires a lot of separate packages to be installed. At least not until we can tell an installation tool to "install this and everything it depends on." ;) -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

[Fred L. Drake, Jr., pines for some flavor of weak refs; MAL reminds us of his work; & back to Fred]
This kind of thing certainly belongs in the core (for efficiency and smooth integration) -- if it belongs in the language at all. This was discussed at length here some months ago; that's what prompted MAL to "do something" about it. Guido hasn't shown visible interest, and nobody has been willing to fight him to the death over it. So it languishes. Buy him lunch tomorrow and get him excited <wink>.

Tim Peters writes:
Guido has asked me to pursue this topic, so I'll be checking out available implementations and seeing if any are adoptable or if something different is needed to be fully general and well-integrated. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

[Fred L. Drake, Jr.]
Just don't let "fully general" stop anything for its sake alone; e.g., if there's a slick trick that *could* exempt numbers, that's all to the good! Adding a pointer to every object is really unattractive, while adding a flag or two to type objects is dirt cheap. Note in passing that current Java addresses weak refs too (several flavors of 'em! -- very elaborate).

On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
Bull! You can easily support "s#" support by returning the pointer to the Unicode buffer. The *entire* reason for introducing "t#" is to differentiate between returning a pointer to an 8-bit [character] buffer and a not-8-bit buffer. In other words, the work done to introduce "t#" was done *SPECIFICALLY* to allow "s#" to return a pointer to the Unicode data. I am with Fredrik on that auxilliary buffer. You'll have two dead bodies to deal with :-) Cheers, -g -- Greg Stein, http://www.lyra.org/

I am with Fredrik on that auxilliary buffer. You'll have two dead bodies to deal with :-)
I haven't made up my mind yet (due to a very successful Python-promoting visit to SD'99 east, I'm about 100 msgs behind in this thread alone) but let me warn you that I can deal with the carnage, if necessary. :-) --Guido van Rossum (home page: http://www.python.org/~guido/)

On Sat, 13 Nov 1999, Guido van Rossum wrote:
Bring it on, big boy! :-) -- Greg Stein, http://www.lyra.org/

On Fri, 12 Nov 1999, Tim Peters wrote:
No... my main point was interaction with the underlying OS. I made a SWAG (Scientific Wild Ass Guess :-) and stated that UTF-8 is probably slower for various types of operations. As always, your infernal meddling has dashed that hypothesis, so I must retreat...
Probably for the exact reason that you stated in your messages: many 8-bit (7-bit?) functions continue to work quite well when given a UTF-8-encoded string. i.e. they didn't have to rewrite the entire Perl/TCL interpreter to deal with a new string type. I'd guess it is a helluva lot easier for us to add a Python Type than for Perl or TCL to whack around with new string types (since they use strings so heavily). Cheers, -g -- Greg Stein, http://www.lyra.org/

Greg Stein wrote:
I know, but this is a little different: you use strings a lot while import hooks are rarely used directly by the user. E.g. people in Europe will probably prefer Latin-1 as default encoding while people in Asia will use one of the common CJK encodings. The <default encoding> decides what encoding to use for many typical tasks: printing, str(u), "s" argument parsing, etc. Note that setting the <default encoding> is not intended to be done prior to single operations. It is meant to be settable at thread creation time.
The reason for UTF-16 is simply that it is identical to UCS-2 over large ranges which makes optimizations (e.g. the UCS2 flag I mentioned in an earlier post) feasable and effective. UTF-8 slows things down for CJK encodings, since the APIs will very often have to scan the string to find the correct logical position in the data. Here's a quote from the Unicode FAQ (http://www.unicode.org/unicode/faq/ ): """ Q: How about using UCS-4 interfaces in my APIs? Given an internal UTF-16 storage, you can, of course, still index into text using UCS-4 indices. However, while converting from a UCS-4 index to a UTF-16 index or vice versa is fairly straightforward, it does involve a scan through the 16-bit units up to the index point. In a test run, for example, accessing UTF-16 storage as UCS-4 characters results in a 10X degradation. Of course, the precise differences will depend on the compiler, and there are some interesting optimizations that can be performed, but it will always be slower on average. This kind of performance hit is unacceptable in many environments. Most Unicode APIs are using UTF-16. The low-level character indexing are at the common storage level, with higher-level mechanisms for graphemes or words specifying their boundaries in terms of the storage units. This provides efficiency at the low levels, and the required functionality at the high levels. Convenience APIs can be produced that take parameters in UCS-4 methods for common utilities: e.g. converting UCS-4 indices back and forth, accessing character properties, etc. Outside of indexing, differences between UCS-4 and UTF-16 are not as important. For most other APIs outside of indexing, characters values cannot really be considered outside of their context--not when you are writing internationalized code. For such operations as display, input, collation, editing, and even upper and lowercasing, characters need to be considered in the context of a string. That means that in any event you end up looking at more than one character. In our experience, the incremental cost of doing surrogates is pretty small. """
All those formats are upward compatible (within certain ranges) and the Python Unicode API will provide converters between its internal format and the few common Unicode implementations, e.g. for MS compilers (16-bit UCS2 AFAIK), GLIBC (32-bit UCS4).
See above.
Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16. """ Note that there currently are no defined surrogate pairs for UTF-16, meaning that in practice the difference between UCS-2 and UTF-16 is probably negligable, e.g. we could define the internal format to be UTF-16 and raise exception whenever the border between UTF-16 and UCS-2 is crossed -- sort of as political compromise ;-). But... I think HP has the last word on this one. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

I can't make time for a close review now. Just one thing that hit my eye early: Python should provide a built-in constructor for Unicode strings which is available through __builtins__: u = unicode(<encoded Python string>[,<encoding name>= <default encoding>]) u = u'<utf-8 encoded Python string>' Two points on the Unicode literals (u'abc'): UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by hand -- it breaks apart and rearranges bytes at the bit level, and everything other than 7-bit ASCII requires solid strings of "high-bit" characters. This is painful for people to enter manually on both counts -- and no common reference gives the UTF-8 encoding of glyphs directly. So, as discussed earlier, we should follow Java's lead and also introduce a \u escape sequence: octet: hexdigit hexdigit unicodecode: octet octet unicode_escape: "\\u" unicodecode Inside a u'' string, I guess this should expand to the UTF-8 encoding of the Unicode character at the unicodecode code position. For consistency, then, it should probably expand the same way inside "regular strings" too. Unlike Java does, I'd rather not give it a meaning outside string literals. The other point is a nit: The vast bulk of UTF-8 encodings encode characters in UCS-4 space outside of Unicode. In good Pythonic fashion, those must either be explicitly outlawed, or explicitly defined. I vote for outlawed, in the sense of detected error that raises an exception. That leaves our future options open. BTW, is ord(unicode_char) defined? And as what? And does ord have an inverse in the Unicode world? Both seem essential. international-in-spite-of-himself-ly y'rs - tim

Tim Peters wrote:
unless you're using a UTF-8 aware editor, of course ;-) (some days, I think we need some way to tell the compiler what encoding we're using for the source file...)
good idea. and by some reason, patches for this is included in the unicode distribution (see the attached str2utf.c).
I vote for 'outlaw'. </F> /* A small code snippet that translates \uxxxx syntax to UTF-8 text. To be cut and pasted into Python/compile.c */ /* Written by Fredrik Lundh, January 1999. */ /* Documentation (for the language reference): \uxxxx -- Unicode character with hexadecimal value xxxx. The character is stored using UTF-8 encoding, which means that this sequence can result in up to three encoded characters. Note that the 'u' must be followed by four hexadecimal digits. If fewer digits are given, the sequence is left in the resulting string exactly as given. If more digits are given, only the first four are translated to Unicode, and the remaining digits are left in the resulting string. */ #define Py_CHARMASK(ch) ch void convert(const char *s, char *p) { while (*s) { if (*s != '\\') { *p++ = *s++; continue; } s++; switch (*s++) { /* -------------------------------------------------------------------- */ /* copy this section to the appropriate place in compile.c... */ case 'u': /* \uxxxx => UTF-8 encoded unicode character */ if (isxdigit(Py_CHARMASK(s[0])) && isxdigit(Py_CHARMASK(s[1])) && isxdigit(Py_CHARMASK(s[2])) && isxdigit(Py_CHARMASK(s[3]))) { /* fetch hexadecimal character value */ unsigned int n, ch = 0; for (n = 0; n < 4; n++) { int c = Py_CHARMASK(*s); s++; ch = (ch << 4) & ~0xF; if (isdigit(c)) ch += c - '0'; else if (islower(c)) ch += 10 + c - 'a'; else ch += 10 + c - 'A'; } /* store as UTF-8 */ if (ch < 0x80) *p++ = (char) ch; else { if (ch < 0x800) { *p++ = 0xc0 | (ch >> 6); *p++ = 0x80 | (ch & 0x3f); } else { *p++ = 0xe0 | (ch >> 12); *p++ = 0x80 | ((ch >> 6) & 0x3f); *p++ = 0x80 | (ch & 0x3f); } } break; } else goto bogus; /* -------------------------------------------------------------------- */ default: bogus: *p++ = '\\'; *p++ = s[-1]; break; } } *p++ = '\0'; } main() { int i; unsigned char buffer[100]; convert("Link\\u00f6ping", buffer); for (i = 0; buffer[i]; i++) if (buffer[i] < 0x20 || buffer[i] >= 0x80) printf("\\%03o", buffer[i]); else printf("%c", buffer[i]); }

[/F, dripping with code]
Yuck -- don't let probable error pass without comment. "must be" == "must be"! [moving backwards]
The code is fine, but I've gotten confused about what the intent is now. Expanding \uxxxx to its UTF-8 encoding made sense when MAL had UTF-8 literals, but now he's got Unicode-escaped literals instead -- and you favor an internal 2-byte-per-char Unicode storage format. In that combination of worlds, is there any use in the *language* (as opposed to in a runtime module) for \uxxxx -> UTF-8 conversion? And MAL, if you're listening, I'm not clear on what a Unicode-escaped literal means. When you had UTF-8 literals, the meaning of something like u"a\340\341" was clear, since UTF-8 is defined as a byte stream and UTF-8 string literals were just a way of specifying a byte stream. As a Unicode-escaped string, I assume the "a" maps to the Unicode "a", but what of the rest? Are the octal escapes to be taken as two separate Latin-1 characters (in their role as a Unicode subset), or as an especially clumsy way to specify a single 16-bit Unicode character? I'm afraid I'd vote for the former. Same issue wrt \x escapes. One other issue: are there "raw" Unicode strings too, as in ur"\u20ac"? There probably should be; and while Guido will hate this, a ur string should probably *not* leave \uxxxx escapes untouched. Nasties like this are why Java defines \uxxxx expansion as occurring in a preprocessing step. BTW, the meaning of \uxxxx in a non-Unicode string is now also unclear (or isn't \uxxxx allowed in a non-Unicode string? that's what I would do ...).

Tim Peters wrote:
I second that.
No, no... :-) I think it was a simple misunderstanding... \uXXXX is only to be used within u'' strings and then gets expanded to *one* character encoded in the internal Python format (which is heading towards UTF-16 without surrogates).
Good points. The conversion goes as follows: · for single characters (and this includes all \XXX sequences except \uXXXX), take the ordinal and interpret it as Unicode ordinal · for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX instead
Not sure whether we really need to make this even more complicated... The \uXXXX strings look ugly, adding a few \\\\ for e.g. REs or filenames won't hurt much in the context of those \uXXXX monsters :-)
BTW, the meaning of \uxxxx in a non-Unicode string is now also unclear (or isn't \uxxxx allowed in a non-Unicode string? that's what I would do ...).
Right. \uXXXX will only be allowed in u'' strings, not in "normal" strings. BTW, if you want to type in UTF-8 strings and have them converted to Unicode, you can use the standard: u = unicode('...string with UTF-8 encoded characters...','utf-8') -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[MAL]
Perfect! [about "raw" Unicode strings]
Alas, this won't stand over the long term. Eventually people will write Python using nothing but Unicode strings -- "regular strings" will eventurally become a backward compatibility headache <0.7 wink>. IOW, Unicode regexps and Unicode docstrings and Unicode formatting ops ... nothing will escape. Nor should it. I don't think it all needs to be done at once, though -- existing languages usually take years to graft in gimmicks to cover all the fine points. So, happy to let raw Unicode strings pass for now, as a relatively minor point, but without agreeing it can be ignored forever.
That's what I figured, and thanks for the confirmation.

Tim Peters wrote:
Thanks :-)
Agreed... note that you could also write your own codec for just this reason and then use: u = unicode('....\u1234...\...\...','raw-unicode-escaped') Put that into a function called 'ur' and you have: u = ur('...\u4545...\...\...') which is not that far away from ur'...' w/r to cosmetics.
-- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[MAL, on raw Unicode strings]
Well, not quite. In general you need to pass raw strings: u = unicode(r'....\u1234...\...\...','raw-unicode-escaped') ^ u = ur(r'...\u4545...\...\...') ^ else Python will replace all the other backslash sequences. This is a crucial distinction at times; e.g., else \b in a Unicode regexp will expand into a backspace character before the regexp processor ever sees it (\b is supposed to be a word boundary assertion).

Tim Peters <tim_one@email.msn.com> wrote:
(\b is supposed to be a word boundary assertion).
in some places, that is. </F> Main Entry: reg·u·lar Pronunciation: 're-gy&-l&r, 're-g(&-)l&r 1 : belonging to a religious order 2 a : formed, built, arranged, or ordered according to some established rule, law, principle, or type ... 3 a : ORDERLY, METHODICAL <regular habits> ... 4 a : constituted, conducted, or done in conformity with established or prescribed usages, rules, or discipline ...

Tim Peters wrote:
Right. Here is a sample implementation of what I had in mind: """ Demo for 'unicode-escape' encoding. """ import struct,string,re pack_format = '>H' def convert_string(s): l = map(None,s) for i in range(len(l)): l[i] = struct.pack(pack_format,ord(l[i])) return l u_escape = re.compile(r'\\u([0-9a-fA-F]{0,4})') def unicode_unescape(s): l = [] start = 0 while start < len(s): m = u_escape.search(s,start) if not m: l[len(l):] = convert_string(s[start:]) break m_start,m_end = m.span() if m_start > start: l[len(l):] = convert_string(s[start:m_start]) hexcode = m.group(1) #print hexcode,start,m_start if len(hexcode) != 4: raise SyntaxError,'illegal \\uXXXX sequence: \\u%s' % hexcode ordinal = string.atoi(hexcode,16) l.append(struct.pack(pack_format,ordinal)) start = m_end #print l return string.join(l,'') def hexstr(s,sep=''): return string.join(map(lambda x,hex=hex,ord=ord: '%02x' % ord(x),s),sep) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[MAL]
It looks like r'\\u0000' will get translated into a 2-character Unicode string. That's probably not good, if for no other reason than that Java would not do this (it would create the obvious 7-character Unicode string), and having something that looks like a Java escape that doesn't *work* like the Java escape will be confusing as heck for JPython users. Keeping track of even-vs-odd number of backslashes can't be done with a regexp search, but is easy if the code is simple <wink>: def unicode_unescape(s): from string import atoi import array i, n = 0, len(s) result = array.array('H') # unsigned short, native order while i < n: ch = s[i] i = i+1 if ch != "\\": result.append(ord(ch)) continue if i == n: raise ValueError("string ends with lone backslash") ch = s[i] i = i+1 if ch != "u": result.append(ord("\\")) result.append(ord(ch)) continue hexchars = s[i:i+4] if len(hexchars) != 4: raise ValueError("\\u escape at end not followed by " "at least 4 characters") i = i+4 for ch in hexchars: if ch not in "01234567890abcdefABCDEF": raise ValueError("\\u" + hexchars + " contains " "non-hex characters") result.append(atoi(hexchars, 16)) # print result return result.tostring()

Tim Peters wrote:
Right...
Guido and I have decided to turn \uXXXX into a standard escape sequence with no further magic applied. \uXXXX will only be expanded in u"" strings. Here's the new scheme: With the 'unicode-escape' encoding being defined as: · all non-escape characters represent themselves as a Unicode ordinal (e.g. 'a' -> U+0061). · all existing defined Python escape sequences are interpreted as Unicode ordinals; note that \xXXXX can represent all Unicode ordinals, and \OOO (octal) can represent Unicode ordinals up to U+01FF. · a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax error to have fewer than 4 digits after \u. Examples: u'abc' -> U+0061 U+0062 U+0063 u'\u1234' -> U+1234 u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+05c Now how should we define ur"abc\u1234\n" ... ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 44 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[MAL]
Does that exclude ur"" strings? Not arguing either way, just don't know what all this means.
Same as before (scream if that's wrong).
· all existing defined Python escape sequences are interpreted as Unicode ordinals;
Same as before (ditto).
note that \xXXXX can represent all Unicode ordinals,
This means that the definition of \xXXXX has changed, then -- as you pointed out just yesterday <wink>, \xABCDq currently acts like \xCDq. Does the new \x definition apply only in u"" strings, or in "" strings too? What is the new \x definition?
and \OOO (octal) can represent Unicode ordinals up to U+01FF.
Same as before (ditto).
· a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax error to have fewer than 4 digits after \u.
Same as before (ditto). IOW, I don't see anything that's changed other than an unspecified new treatment of \x escapes, and possibly that ur"" strings don't expand \u escapes.
The last example is damaged (U+05c isn't legit). Other than that, these look the same as before.
Now how should we define ur"abc\u1234\n" ... ?
If strings carried an encoding tag with them, the obvious answer is that this acts exactly like r"abc\u1234\n" acts today except gets a "unicode-escaped" encoding tag instead of a "[whatever the default is today]" encoding tag. If strings don't carry an encoding tag with them, you're in a bit of a pickle: you'll have to convert it to a regular string or a Unicode string, but in either case have no way to communicate that it may need further processing; i.e., no way to distinguish it from a regular or Unicode string produced by any other mechanism. The code I posted yesterday remains my best answer to that unpleasant puzzle (i.e., produce a Unicode string, fiddling with backslashes just enough to get the \u escapes expanded, in the same way Java's (conceptual) preprocessor does it).

Tim Peters wrote:
Guido decided to make \xYYXX return U+YYXX *only* within u"" strings. In "" (Python strings) the same sequence will result in chr(0xXX).
The difference is that we no longer take the two step approach. \uXXXX is treated at the same time all other escape sequences are decoded (the previous version first scanned and decoded all standard Python sequences and then turned to the \uXXXX sequences in a second scan).
Corrected; thanks.
They don't have such tags... so I guess we're in trouble ;-) I guess to make ur"" have a meaning at all, we'd need to go the Java preprocessor way here, i.e. scan the string *only* for \uXXXX sequences, decode these and convert the rest as-is to Unicode ordinals. Would that be ok ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Read Tim's code (posted about 40 messages ago in this list). Like Java, it interprets \u.... when the number of backslashes is odd, but not when it's even. So \\u.... returns exactly that, while \\\u.... returns two backslashes and a unicode character. This is nice and can be done regardless of whether we are going to interpret other \ escapes or not. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
I did, but wasn't sure whether he was argueing for going the Java way...
So I'll take that as: this is what we want in Python too :-) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

I'll reserve judgement until we've got some experience with it in the field, but it seems the best compromise. It also gives a clear explanation about why we have \uXXXX when we already have \xXXXX. --Guido van Rossum (home page: http://www.python.org/~guido/)

Would this definition be fine ? """ u = ur'<raw-unicode-escape encoded Python string>' The 'raw-unicode-escape' encoding is defined as follows: · \uXXXX sequence represent the U+XXXX Unicode character if and only if the number of leading backslashes is odd · all other characters represent themselves as Unicode ordinal (e.g. 'b' -> U+0062) """ -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 43 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Yes. --Guido van Rossum (home page: http://www.python.org/~guido/)

Tim Peters wrote:
It would be more conform to use the Unicode ordinal (instead of interpreting the number as UTF8 encoding), e.g. \u03C0 for Pi. The codes are easy to look up in the standard's UnicodeData.txt file or the Unicode book for that matter.
See my other post for a discussion of UCS4 vs. UTF16 vs. UCS2. Perhaps we could add a flag to Unicode objects stating whether the characters can be treated as UCS4 limited to the lower 16 bits (UCS4 and UTF16 are the same in most ranges). This flag could then be used to choose optimized algorithms for scanning the strings. Fredrik's implementation currently uses UCS2, BTW.
BTW, is ord(unicode_char) defined? And as what? And does ord have an inverse in the Unicode world? Both seem essential.
Good points. How about uniord(u[:1]) --> Unicode ordinal number (32-bit) unichr(i) --> Unicode object for character i (provided it is 32-bit); ValueError otherwise They are inverse of each other, but note that Unicode allows private encodings too, which will of course not necessarily make it across platforms or even from one PC to the next (see Andy Robinson's interesting case study). I've uploaded a new version of the proposal (0.3) to the URL: http://starship.skyport.net/~lemburg/unicode-proposal.txt Thanks, -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Wed, 10 Nov 1999, M.-A. Lemburg wrote:
Why new functions? Why not extend the definition of ord() and chr()? In terms of backwards compatibility, the only issue could possibly be that people relied on chr(x) to throw an error when x>=256. They certainly couldn't pass a Unicode object to ord(), so that function can safely be extended to accept a Unicode object and return a larger integer. Cheers, -g -- Greg Stein, http://www.lyra.org/

Greg Stein wrote:
Because unichr() will always have to return Unicode objects. You don't want chr(i) to return Unicode for i>255 and strings for i<256. OTOH, ord() could probably be extended to also work on Unicode objects. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[MAL, on Unicode chr() and ord()
Indeed I do not!
OTOH, ord() could probably be extended to also work on Unicode objects.
I think should be -- it's a good & natural use of polymorphism; introducing a new function *here* would be as odd as introducing a unilen() function to get the length of a Unicode string.

Tim Peters wrote:
Fine. So I'll drop the uniord() API and extend ord() instead. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Guido van Rossum <guido@CNRI.Reston.VA.US> wrote:
Marc-Andre writes: The internal format for Unicode objects should either use a Python specific fixed cross-platform format <PythonUnicode> (e.g. 2-byte little endian byte order) or a compiler provided wchar_t format (if available). Using the wchar_t format will ease embedding of Python in other Unicode aware applications, but will also make internal format dumps platform dependent. having been there and done that, I strongly suggest a third option: a 16-bit unsigned integer, in platform specific byte order (PY_UNICODE_T). along all other roads lie code bloat and speed penalties... (besides, this is exactly how it's already done in unicode.c and what 'sre' prefers...) </F>

Fredrik Lundh wrote:
Ok, byte order can cause a speed penalty, so it might be worthwhile introducing sys.bom (or sys.endianness) for this reason and sticking to 16-bit integers as you have already done in unicode.h. What I don't like is using wchar_t if available (and then addressing it as if it were defined as unsigned integer). IMO, it's better to define a Python Unicode representation which then gets converted to whatever wchar_t represents on the target machine. Another issue is whether to use UCS2 (as you have done) or UTF16 (which is what Unicode 3.0 requires)... see my other post for a discussion. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

you should read the unicode.h file a bit more carefully: ... /* Unicode declarations. Tweak these to match your platform */ /* set this flag if the platform has "wchar.h", "wctype.h" and the wchar_t type is a 16-bit unsigned type */ #define HAVE_USABLE_WCHAR_H #if defined(WIN32) || defined(HAVE_USABLE_WCHAR_H) (this uses wchar_t, and also iswspace and friends) ... #else /* Use if you have a standard ANSI compiler, without wchar_t support. If a short is not 16 bits on your platform, you have to fix the typedef below, or the module initialization code will complain. */ (this maps iswspace to isspace, for 8-bit characters). #endif ... the plan was to use the second solution (using "configure" to figure out what integer type to use), and its own uni- code database table for the is/to primitives (iirc, the unicode.txt file discussed this, but that one seems to be missing from the zip archive). </F>

Fredrik Lundh wrote:
Oh, I did read unicode.h, stumbled across the mixed usage and decided not to like it ;-) Seriously, I find the second solution where you use the 'unsigned short' much more portable and straight forward. You never know what the compiler does for isw*() and it's probably better sticking to one format for all platforms. Only endianness gets in the way, but that's easy to handle. So I opt for 'unsigned short'. The encoding used in these 2 bytes is a different question though. If HP insists on Unicode 3.0, there's probably no other way than to use UTF-16.
(iirc, the unicode.txt file discussed this, but that one seems to be missing from the zip archive).
It's not in the file I downloaded from your site. Could you post it here ? -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Fredrik Lundh writes:
I actually like this best, but I understand that there are reasons for using wchar_t, especially for interfacing with other code that uses Unicode. Perhaps someone who knows more about the specific issues with interfacing using wchar_t can summarize them, or point me to whatever I've already missed. p-) -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives

On Wed, 10 Nov 1999, Fredrik Lundh wrote:
I agree 100% !! wchar_t will introduce portability issues right on up into the Python level. The byte-order introduces speed issues and OS interoperability issues, yet solves no portability problems (Byte Order Marks should still be present and used). There are two "platforms" out there that use Unicode: Win32 and Java. They both use UCS-2, AFAIK. Cheers, -g -- Greg Stein, http://www.lyra.org/

Guido van Rossum <guido@CNRI.Reston.VA.US> wrote:
Marc-Andre writes: Unicode objects should have a pointer to a cached (read-only) char buffer <defencbuf> holding the object's value using the current <default encoding>. This is needed for performance and internal parsing (see below) reasons. The buffer is filled when the first conversion request to the <default encoding> is issued on the object. keeping track of an external encoding is better left for the application programmers -- I'm pretty sure that different application builders will want to handle this in radically different ways, depending on their environ- ment, underlying user interface toolkit, etc. besides, this is how Tcl would have done it. Python's not Tcl, and I think you need *very* good arguments for moving in that direction. </F>

Fredrik Lundh wrote:
It's not that hard to implement. All you have to do is check whether the current encoding in <defencbuf> still is the same as the threads view of <default encoding>. The <defencbuf> buffer is needed to implement "s" et al. argument parsing anyways.
-- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Just a couple observations from the peanut gallery... 1. I'm glad I don't have to do this Unicode/UTF/internationalization stuff. Seems like it would be easier to just get the whole world speaking Esperanto. 2. Are there plans for an internationalization session at IPC8? Perhaps a few key players could be locked into a room for a couple days, to emerge bloodied, but with an implementation in-hand... Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented...

"SM" == Skip Montanaro <skip@mojam.com> writes:
SM> 2. Are there plans for an internationalization session at SM> IPC8? Perhaps a few key players could be locked into a room SM> for a couple days, to emerge bloodied, but with an SM> implementation in-hand... I'm starting to think about devday topics. Sounds like an I18n session would be very useful. Champions? -Barry
participants (15)
-
Andrew M. Kuchling
-
Andy Robinson
-
Barry A. Warsaw
-
David Ascher
-
Fred L. Drake, Jr.
-
Fredrik Lundh
-
Gordon McMillan
-
Greg Stein
-
Guido van Rossum
-
Jean-Claude Wippler
-
Ka-Ping Yee
-
M.-A. Lemburg
-
Mark Hammond
-
Skip Montanaro
-
Tim Peters