Mailman 3 November 1999 - Python-Dev

Re: [Python-Dev] just say no...
by Jack Jansen Nov. 13, 1999

Nov. 13, 1999

The problem with "s" and "s#" is that they're already semantically overloaded, and will become more so with support for multiple charsets. Some modules use "s#" when they mean "give me a pointer to an area of memory and its length". Writing to binary files is an example of this. Some modules use it to mean "give me a pointer to a string". Writing to a text file is (probably) an example of this. Some modules use it to mean "give me a pointer to an 8-bit ASCII string". This is the case if … [View More]

3 4

RE: [Python-Dev] Internationalization Toolkit
by Andy Robinson Nov. 12, 1999

Nov. 12, 1999

--- "Da Silva, Mike" <Mike.Da.Silva(a)uk.fid-intl.com> wrote: > As I see it, the relative pros and cons of UTF-8 > versus UTF-16 for use as an > internal string representation are: > [snip] > Regards, > Mike da Silva > Note that by going with UTF16, we get both. We will certainly have a codec for utf8, just as we will for ISO-Latin-1, Shift-JIS or whatever. And a perfectly ordinary Python string is a great place to hold UTF8; you can look at it and use most of the … [View More]

1 0

RE: [Python-Dev] Internationalization Toolkit
by Andy Robinson Nov. 12, 1999

Nov. 12, 1999

--- Gordon McMillan <gmcm(a)hypernet.com> wrote: > [per-thread defaults] > > C'mon guys, hasn't anyone ever played consultant > before? The > idea is obviously brain-dead. OTOH, they asked for > it > specifically, meaning they have some assumptions > about how > they think they're going to use it. If you give them > what they > ask for, you'll only have to fix it when they > realize there are > other ways of doing things that don't work with &… [View More]

2 1

Internationalization Toolkit
by Andy Robinson Nov. 12, 1999

Nov. 12, 1999

In general, I like this proposal a lot, but I think it only covers half the story. How we actually build the encoder/decoder for each encoding is a very big issue. Thoughts on this below. First, a little nit > u = u'<utf-8 encoded Python string>' I don't like using funny prime characters - why not an explicit function like "utf8()" On to the important stuff:> > unicodec.register(<encname>,<encoder>,<decoder> > [,<stream_encoder>, <… [View More]stream_decoder>]) > This registers the codecs under the given encoding > name in the module global dictionary > unicodec.codecs. Stream codecs are optional: > the unicodec module will provide appropriate > wrappers around <encoder> and > <decoder> if not given. I would MUCH prefer a single 'Encoding' class or type to wrap up these things, rather than up to four disconnected objects/functions. Essentially it would be an interface standard and would offer methods to do the four things above. There are several reasons for this. (1) there are quite a lot of things you might want to do with an encoding object, and we could extend the interface in future easily. As a minimum, give it the four methods implied by the above, two of which can be defaults. But I'd like an encoding to be able to tell me the set of characters to which it applies; validate a string; and maybe tell me if it is a subset or superset of another. (2) especially with double-byte encodings, they will need to load up some kind of database on startup and use this for both encoding and decoding - much better to share it and encapsulate it inside one object (3) for some languages, there are extra functions wanted. For Japanese, you need two or three functions to expand half-width to full-width katakana, convert double-byte english to single-byte and vice versa. A Japanese encoding object would be a handy place to put this knowledge. (4) In the real world you get many encodings which are subtle variations of the same thing, plus or minus a few characters. One bit of code might be able to share the work of several encodings, by setting a few flags. Certainly true of Japanese. (5) encoding/decoding algorithms can be program or data or (very often) a bit of both. We have not yet discussed where to keep all the mapping tables, but if data is involved it should be hidden in an object. (6) See my comments on a state machine for doing the encodings. If this is done well, we might two different standard objects which conform to the Encoding interface (a really light one for single-byte encodings, and a bigger one for multi-byte), and everything else could be data driven. (6) Easy to grow - encodings can be prototyped and proven in Python, ported to C if needed or when ready. In summary, firm up the concept of an Encoding object and give it room to grow - that's the key to real-world usefulness. If people feel the same way I'll have a go at an interface for that, and try show how it would have simplified specific problems I have faced. We also need to think about where encoding info will live. You cannot avoid mapping tables, although you can hide them inside code modules or pickled objects if you want. Should there be a standard "..\Python\Enc" directory? And we're going to need some kind of testing and certification procedure when adding new encodings. This stuff has to be right. Guido asked about TypedString. This can probably be done on top of the built-in stuff - it is just a convenience which would clarify intent, reduce lines of code and prevent people shooting themselves in the foot when juggling a lot of strings in different (non-Unicode) encodings. I can do a Python module to implement that on top of whatever is built. Regards, Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com [View Less]

5 6

Re: [Python-Dev] Internationalization Toolkit
by Andy Robinson Nov. 11, 1999

Nov. 11, 1999

> See my other post on the subject... > > Note that if we make UTF-8 the standard encoding, > nearly all > special Latin-1 characters will produce UTF-8 errors > on input > and unreadable garbage on output. That will probably > be unacceptable > in Europe. To remedy this, one would *always* have > to use > u.encode('latin-1') to get readable output for > Latin-1 strings > repesented in Unicode. You beat me to it - a colleague and I were just discussing … [View More]

2 1

Re: [Python-Dev] Internationalization Toolkit
by Jack Jansen Nov. 11, 1999

Nov. 11, 1999

> > [MAL, on Unicode chr() and ord() > > > ... > > > Because unichr() will always have to return Unicode objects. You don't > > > want chr(i) to return Unicode for i>255 and strings for i<256. > > > OTOH, ord() could probably be extended to also work on Unicode objects. > Fine. So I'll drop the uniord() API and extend ord() instead. Hmm, then wouldn't it be more logical to drop unichr() too, but add an optional parameter to chr() to specify … [View More]

2 1

Re: [Python-Dev] Internationalization Toolkit
by Andy Robinson Nov. 11, 1999

Nov. 11, 1999

> I say axe it and say "UTF-8" is the fixed, default > encoding. If you want > something else, then do that explicitly. > Let me tell you why you would want to have an encoding which can be set: (1) sday I am on a Japanese Windows box, I have a string called 'address' and I do 'print address'. If I see utf8, I see garbage. If I see Shift-JIS, I see the correct Japanese address. At this point in time, utf8 is an interchange format but 99% of the world's data is in various native … [View More]

3 2

Re: [Python-Dev] Internationalization Toolkit
by Andy Robinson Nov. 11, 1999

Nov. 11, 1999

> You almost convinced me there, but I think this can > still be done > without changing the default encoding: simply reopen > stdout with a > different encoding. This is how Java does it. I/O > streams with an > encoding specified at open() are a very powerful > feature. You can > hide this in your $PYTHONSTARTUP. Good point, I'm happy with this. Make sure we specify it in the docs as the right way to do it. In an IDE, we'd have an Options screen somewhere for … [View More]

1 0

Re: [Python-Dev] Internationalization Toolkit (fwd)
by Greg Stein Nov. 11, 1999

Nov. 11, 1999

Andy originally sent this just to me... I replied in kind, but saw that he sent another copy to python-dev. Sending my reply there... ---------- Forwarded message ---------- Date: Thu, 11 Nov 1999 04:00:38 -0800 (PST) From: Greg Stein <gstein(a)lyra.org> To: andy(a)robanal.demon.co.uk Subject: Re: [Python-Dev] Internationalization Toolkit [ note: you sent direct to me; replying in kind in case that was your intent ] On Wed, 10 Nov 1999, [iso-8859-1] Andy Robinson wrote: >... > … [View More]

1 0

Re: [Python-Dev] default encodings (was: Internationalization Toolkit)
by Andy Robinson Nov. 11, 1999

Nov. 11, 1999

> > What about just explaining the rationale for the > default-less point of > view to whoever is in charge of this at HP and see > why they came up with > their rationale in the first place? They might have > a good reason, or > they might be willing to change said requirement. > > --david For that matter (I came into this a bit late), is there a statement somewhere of what HP actually want to do? - Andy ===== Andy Robinson Robinson Analytics Ltd. ---------… [View More]

2 1