The problem with "s" and "s#" is that they're already semantically
overloaded, and will become more so with support for multiple charsets.
Some modules use "s#" when they mean "give me a pointer to an area of memory
and its length". Writing to binary files is an example of this.
Some modules use it to mean "give me a pointer to a string". Writing to a text
file is (probably) an example of this.
Some modules use it to mean "give me a pointer to an 8-bit ASCII string". This
is the case if we're going to actually look at the contents (think of
string.upper() and such).
I think that the only real solution is to define what "s" means, come up with
new getarg-formats for the other two use cases and convert all modules to use
the new standard. It'll still cause grief to extension modules that aren't
part of the core, but at least the problem will go away after a while.
--
Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen(a)oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm
--- "Da Silva, Mike" <Mike.Da.Silva(a)uk.fid-intl.com>
wrote:
> As I see it, the relative pros and cons of UTF-8
> versus UTF-16 for use as an
> internal string representation are:
> [snip]
> Regards,
> Mike da Silva
>
Note that by going with UTF16, we get both. We will
certainly have a codec for utf8, just as we will for
ISO-Latin-1, Shift-JIS or whatever. And a perfectly
ordinary Python string is a great place to hold UTF8;
you can look at it and use most of the ordinary string
algorithms on it.
I presume no one is actually advocating dropping
ordinary Python strings, or the ability to do
rawdata = open('myfile.txt', 'rb').read()
without any transformations?
- Andy
=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.
__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com
--- Gordon McMillan <gmcm(a)hypernet.com> wrote:
> [per-thread defaults]
>
> C'mon guys, hasn't anyone ever played consultant
> before? The
> idea is obviously brain-dead. OTOH, they asked for
> it
> specifically, meaning they have some assumptions
> about how
> they think they're going to use it. If you give them
> what they
> ask for, you'll only have to fix it when they
> realize there are
> other ways of doing things that don't work with
> per-thread
> defaults. So, you find out why they think it's a
> good thing; you
> make it easy for them to code this way (without
> actually using
> per-thread defaults) and you don't make a fuss about
> it. More
> than likely, they won't either.
>
I wrote directly to ask them exactly this last night.
Let's forget the per-thread thing until we get an
answer.
- Andy
=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.
__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com
In general, I like this proposal a lot, but I think it
only covers half the story. How we actually build the
encoder/decoder for each encoding is a very big issue.
Thoughts on this below.
First, a little nit
> u = u'<utf-8 encoded Python string>'
I don't like using funny prime characters - why not an
explicit function like "utf8()"
On to the important stuff:>
> unicodec.register(<encname>,<encoder>,<decoder>
> [,<stream_encoder>, <stream_decoder>])
> This registers the codecs under the given encoding
> name in the module global dictionary
> unicodec.codecs. Stream codecs are optional:
> the unicodec module will provide appropriate
> wrappers around <encoder> and
> <decoder> if not given.
I would MUCH prefer a single 'Encoding' class or type
to wrap up these things, rather than up to four
disconnected objects/functions. Essentially it would
be an interface standard and would offer methods to do
the four things above.
There are several reasons for this.
(1) there are quite a lot of things you might want to
do with an encoding object, and we could extend the
interface in future easily. As a minimum, give it the
four methods implied by the above, two of which can be
defaults. But I'd like an encoding to be able to tell
me the set of characters to which it applies; validate
a string; and maybe tell me if it is a subset or
superset of another.
(2) especially with double-byte encodings, they will
need to load up some kind of database on startup and
use this for both encoding and decoding - much better
to share it and encapsulate it inside one object
(3) for some languages, there are extra functions
wanted. For Japanese, you need two or three functions
to expand half-width to full-width katakana, convert
double-byte english to single-byte and vice versa. A
Japanese encoding object would be a handy place to put
this knowledge.
(4) In the real world you get many encodings which are
subtle variations of the same thing, plus or minus a
few characters. One bit of code might be able to
share the work of several encodings, by setting a few
flags. Certainly true of Japanese.
(5) encoding/decoding algorithms can be program or
data or (very often) a bit of both. We have not yet
discussed where to keep all the mapping tables, but if
data is involved it should be hidden in an object.
(6) See my comments on a state machine for doing the
encodings. If this is done well, we might two
different standard objects which conform to the
Encoding interface (a really light one for single-byte
encodings, and a bigger one for multi-byte), and
everything else could be data driven.
(6) Easy to grow - encodings can be prototyped and
proven in Python, ported to C if needed or when ready.
In summary, firm up the concept of an Encoding object
and give it room to grow - that's the key to
real-world usefulness. If people feel the same way
I'll have a go at an interface for that, and try show
how it would have simplified specific problems I have
faced.
We also need to think about where encoding info will
live. You cannot avoid mapping tables, although you
can hide them inside code modules or pickled objects
if you want. Should there be a standard
"..\Python\Enc" directory?
And we're going to need some kind of testing and
certification procedure when adding new encodings.
This stuff has to be right.
Guido asked about TypedString. This can probably be
done on top of the built-in stuff - it is just a
convenience which would clarify intent, reduce lines
of code and prevent people shooting themselves in the
foot when juggling a lot of strings in different
(non-Unicode) encodings. I can do a Python module to
implement that on top of whatever is built.
Regards,
Andy
=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.
__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com
> See my other post on the subject...
>
> Note that if we make UTF-8 the standard encoding,
> nearly all
> special Latin-1 characters will produce UTF-8 errors
> on input
> and unreadable garbage on output. That will probably
> be unacceptable
> in Europe. To remedy this, one would *always* have
> to use
> u.encode('latin-1') to get readable output for
> Latin-1 strings
> repesented in Unicode.
You beat me to it - a colleague and I were just
discussing this verbally. Specifically we Brits will
get annoyed as soon as we read in a text file with
pound (sterling) signs.
We concluded that the only reasonable default (if you
have one at all) is pure ASCII. At least that way I
will get a clear and intelligible warning when I load
in such a file, and will remember to specify
ISO-Latin-1.
- Andy
=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.
__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com
> > [MAL, on Unicode chr() and ord()
> > > ...
> > > Because unichr() will always have to return Unicode objects. You don't
> > > want chr(i) to return Unicode for i>255 and strings for i<256.
> > > OTOH, ord() could probably be extended to also work on Unicode objects.
> Fine. So I'll drop the uniord() API and extend ord() instead.
Hmm, then wouldn't it be more logical to drop unichr() too, but add an
optional parameter to chr() to specify what sort of a string you want? The
type-object of a unicode string comes to mind...
--
Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen(a)oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm
> I say axe it and say "UTF-8" is the fixed, default
> encoding. If you want
> something else, then do that explicitly.
>
Let me tell you why you would want to have an encoding
which can be set:
(1) sday I am on a Japanese Windows box, I have a
string called 'address' and I do 'print address'. If
I see utf8, I see garbage. If I see Shift-JIS, I see
the correct Japanese address. At this point in time,
utf8 is an interchange format but 99% of the world's
data is in various native encodings.
Analogous problems occur on input.
(2) I'm using htmlgen, which 'prints' objects to
standard output. My web site is supposed to be
encoded in Shift-JIS (or EUC, or Big 5 for Taiwan,
etc.) Yes, browsers CAN detect and display UTF8 but
you just don't find UTF8 sites in the real world - and
most users just don't know about the encoding menu,
and will get pissed off if they have to reach for it.
Ditto for streaming output in some protocol.
Java solves this (and we could too by hacking stdout)
using Writer classes which are created as wrappers
around an output stream and can take an encoding, but
you lose the flexibility to 'just print'.
I think being able to change encoding would be useful.
What I do not want is to auto-detect it from the
operating system when Python boots - that would be a
portability nightmare.
Regards,
Andy
=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.
__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com
> You almost convinced me there, but I think this can
> still be done
> without changing the default encoding: simply reopen
> stdout with a
> different encoding. This is how Java does it. I/O
> streams with an
> encoding specified at open() are a very powerful
> feature. You can
> hide this in your $PYTHONSTARTUP.
Good point, I'm happy with this. Make sure we specify
it in the docs as the right way to do it. In an IDE,
we'd have an Options screen somewhere for the output
encoding.
What the Java code I have seen does is to open a raw
file and construct wrappers (InputStreamReader,
OutputStreamWriter) around it to do an encoding
conversion. This kind of obfuscates what is going on
- Python just needs the extra argument.
- Andy
=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.
__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com
Andy originally sent this just to me... I replied in kind, but saw that he
sent another copy to python-dev. Sending my reply there...
---------- Forwarded message ----------
Date: Thu, 11 Nov 1999 04:00:38 -0800 (PST)
From: Greg Stein <gstein(a)lyra.org>
To: andy(a)robanal.demon.co.uk
Subject: Re: [Python-Dev] Internationalization Toolkit
[ note: you sent direct to me; replying in kind in case that was your
intent ]
On Wed, 10 Nov 1999, [iso-8859-1] Andy Robinson wrote:
>...
> Let me tell you why you would want to have an encoding
> which can be set:
>...snip: two examples of how "print" fails...
Neither of those examples are solid reasons for having a default encoding
that can be changed. Both can easily be altered at the Python level by
using an encoding function before printing.
You're asking for convenience, *not* providing a reason.
> Java solves this (and we could too) using Writer
> classes which are created as wrappers around an output
> stream and can take an encoding, but you lose the
> flexibility to just print.
Not flexibility: convenience. You can certainly do:
print encode(u,'Shift-JIS')
> I think being able to change encoding would be useful.
> What I do not want is to auto-detect it from the
> operating system when Python boots - that would be a
> portability nightmare.
Useful, but not a requirement.
Keep the interpreter simple, understandable, and predictable. A module
that changes the default over to 'utf-8' because it is interacting with a
network object is going to screw up your app if you're relying on an
encoding of 'shift-jis' to be present.
Cheers,
-g
--
Greg Stein, http://www.lyra.org/
>
> What about just explaining the rationale for the
> default-less point of
> view to whoever is in charge of this at HP and see
> why they came up with
> their rationale in the first place? They might have
> a good reason, or
> they might be willing to change said requirement.
>
> --david
For that matter (I came into this a bit late), is
there a statement somewhere of what HP actually want
to do?
- Andy
=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.
__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com