Python3 - encoding issues
DreiJane
joost at h-labahn.de
Sat Nov 28 21:32:09 EST 2009
Hello,
at first i must beg the pardon of those from you, whose mailboxes got
flooded by my last announcement of depikt. I myself get no emails from
this list, and when i had done my corrections and posted each of the
sligthly improved versions, i wasn't aware of the extra emails that
produces. Sorry !
I read here recently, that some reagard Python3 worse at encoding
issues than former versions. For me, a German, quite the contrary is
true. The automatic conversion without an Exception from before 3 has
caused pain over pain during the last years. Even some weeks before it
happened, that pygtk suddenly returned utf-8, not unicode, and my
software had delivered a lot of muddled automatically written emails,
before i saw the mess. Python 3 would have raised Exceptions - however
the translation of my software to 3 has just begun.
Now there is a concept of two separated worlds, and i have decided to
use bytes for my software. The string representation, that output
needs anyway, and with depikt and a changed apsw (file reads anyway)
or other database-APIs (internally they all understand utf-8) i can
get utf-8 for all input too.
This means, that i do not have the standard string methods, but
substitutes are easily made. Not for a subclass of bytes, that
wouldn't have the b"...." initialization. Thus only in form of
functions. Here are some of my utools:
u0 = "".encode('utf-8')
def u(s):
if type(s) in (int, float, type): s = str(s)
if type(s) == str: return s.encode("utf-8")
if type(s) == bytes: # we keep the two worlds cleanly separated
raise TypeError(b"argument is bytes already")
raise TypeError(b"Bad argument for utf-encoding")
def u_startswith(s, test):
try:
if s.index(test) == 0: return True
except: # a bit frisky perhaps
return False
def u_endswith(s, test):
if s[-len(test):] == test: return True
return False
def u_split(s, splitter):
ret = []
while s and splitter in s:
if u_startswith(s, splitter):
s = s[len(splitter):]; continue
ret += s[:s.index[splitter]]
return ret + [s]
def u_join(joiner, l):
while True:
if len(l) in (0,1): return l
else: l = [l[0]+joiner+l[1]]+l[2:]
(not all with the standard signatures). Writing them is trivial. Note
u0 - unfortunately b"" doesn't at all work as expected, i had to learn
the hard way.
Looking more close to these functions one sees, that they only use the
sequence protocol. "index" is in the sequence protocol too now - there
the library reference has still to be updated. Thus all of these and
much more string methods could get to the sequence protocol too
without much work - then nobody would have to write all this. This
doesn't only affect string-like objects: split and join for lists
could open interesting possibilities for list representations of trees
for example.
Does anybody want to make a PEP from this (i won't do so) ?
Joost Behrends
More information about the Python-list
mailing list