
It depends on the output you want to have. One option would be s=codecs.lookup('unicode-escape')[3](sys.stdout) Then, s.write(u'\251') prints a string in Python quoting notation. Unfortunately, print >>s,u'\251' won't work, since print *first* tries to convert the argument to a string, and then prints the string onto the stream.
On the other hand, it's annoying that I can't create a file-object that takes Unicode strings from "print", and doesn't seem intuitive.
Since you are asking for a hack :-) How about having an additional letter of 'u' in the "mode" attribute of a file object? Then, print would be def print(stream,string): if type(string) == UnicodeType: if 'u' in stream.mode: stream.write(string) return stream.write(str(string)) The Stream readers and writers would then need to have a mode or 'ru' or 'wu', respectively. Any other protocol to signal unicode-awareness in a stream might do as well. Regards, Martin P.S. Is there some function to retrieve the UCN names from ucnhash.c?

Martin von Loewis wrote:
P.S. Is there some function to retrieve the UCN names from ucnhash.c?
No, there's not even a way to extract those names... a table is there (_Py_UnicodeCharacterName in ucnhash.c), but no access function. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

P.S. Is there some function to retrieve the UCN names from ucnhash.c?
the "unicodenames" patch (which replaces ucnhash) includes this functionality -- but with a little distance, I think it's better to add it to the unicodedata module. (it's included in the step 4 patch, soon to be posted to a patch manager near you...) </F>

Martin von Loewis wrote:
If you need speed, you'd have to write a C codec for this and yes: the ucnhash module does import a C API using a PyCObject which you can use to access the static C data table. Don't know if Fredrik's version will also support this. I think a C function as access method would be more generic than the current direct C table access.
If you just need a single encoding, e.g. Latin-1, simply clone the codec (it's coded in unicodeobject.c) and add the XML entity processing. Unfortunately, reusing the existing codecs is not too efficient: the reason is that there is no error handling which would permit you to say "encode as far as you can and then return the encoded data plus a position marker in the input stream/data". Perhaps we should add a new standard error handling scheme "break" which simply stops encoding/decoding whenever an error occurrs ?! This should then allow reusing existing codecs by processing the input in slices. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Martin von Loewis wrote:
P.S. Is there some function to retrieve the UCN names from ucnhash.c?
No, there's not even a way to extract those names... a table is there (_Py_UnicodeCharacterName in ucnhash.c), but no access function. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

P.S. Is there some function to retrieve the UCN names from ucnhash.c?
the "unicodenames" patch (which replaces ucnhash) includes this functionality -- but with a little distance, I think it's better to add it to the unicodedata module. (it's included in the step 4 patch, soon to be posted to a patch manager near you...) </F>

Martin von Loewis wrote:
If you need speed, you'd have to write a C codec for this and yes: the ucnhash module does import a C API using a PyCObject which you can use to access the static C data table. Don't know if Fredrik's version will also support this. I think a C function as access method would be more generic than the current direct C table access.
If you just need a single encoding, e.g. Latin-1, simply clone the codec (it's coded in unicodeobject.c) and add the XML entity processing. Unfortunately, reusing the existing codecs is not too efficient: the reason is that there is no error handling which would permit you to say "encode as far as you can and then return the encoded data plus a position marker in the input stream/data". Perhaps we should add a new standard error handling scheme "break" which simply stops encoding/decoding whenever an error occurrs ?! This should then allow reusing existing codecs by processing the input in slices. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
participants (3)
-
Fredrik Lundh
-
M.-A. Lemburg
-
Martin von Loewis