"Andrew M. Kuchling" wrote:
Here's a strawman codec for doing the \N{NULL} thing. Questions:
0) Is the code below correct?
Some comments below.
1) What the heck would this encoding be called?
Ehm, 'unicode-with-smileys' I guess... after all that's what motivated the thread ;-) Seriously, I'd go with 'unicode-named'. You can then stack it on top of 'unicode-escape' and get the best of both worlds...
2) What does .encode() do? (Right now it escapes \N as \N{BACKSLASH}N.)
.encode() should translate Unicode to a string. Since the named char thing is probably only useful on input, I'd say: don't do anything, except maybe return input.encode('unicode-escape').
3) How can we store all those names? The resulting dictionary makes a 361K .py file; Python dumps core trying to parse it. (Another bug...)
I've made the same experience with the large Unicode mapping tables... the trick is to split the dictionary definition in chunks and then use dict.update() to paste them together again.
4) What do you with the error \N{...... no closing right bracket. Right now it stops at that point, and never advances any farther. Maybe it should assume it's an error if there's no } within the next 200 chars or some similar limit?
I'd suggest to take the upper bound of all Unicode name lengths as limit.
5) Do we need StreamReader/Writer classes, too?
If you plan to have it registered with a codec search function, yes. No big deal though, because you can use the Codec class as basis for them: class StreamWriter(Codec,codecs.StreamWriter): pass class StreamReader(Codec,codecs.StreamReader): pass ### encodings module API def getregentry(): return (Codec().encode,Codec().decode,StreamReader,StreamWriter) Then call drop the scripts into the encodings package dir and it should be useable via unicode(r'\N{SMILEY}','unicode-named') and u":-)".encode('unicode-named').
I've also add a script that parses the names out of the NameList.txt file at ftp://ftp.unicode.org/Public/UNIDATA/.
--amk
namecodec.py: =============
import codecs
#from _namedict import namedict namedict = {'NULL': 0, 'START OF HEADING' : 1, 'BACKSLASH':ord('\\')}
class NameCodec(codecs.Codec): def encode(self,input,errors='strict'): # XXX what should this do? Escape the # sequence \N as '\N{BACKSLASH}N'? return input.replace( '\\N', '\\N{BACKSLASH}N' )
You should return a string on output... input will be a Unicode object and the return value too if you don't add e.g. an .encode('unicode-escape').
def decode(self,input,errors='strict'): output = unicode("") last = 0 index = input.find( u'\\N{' ) while index != -1: output = output + unicode( input[last:index] ) used = index r_bracket = input.find( '}', index) if r_bracket == -1: # No closing bracket; bail out... break
name = input[index + 3 : r_bracket] code = namedict.get( name ) if code is not None: output = output + unichr(code) elif errors == 'strict': raise ValueError, 'Unknown character name %s' % repr(name)
This could also be UnicodeError (its a subclass of ValueError).
elif errors == 'ignore': pass elif errors == 'replace': output = output + unichr( 0xFFFD )
'\uFFFD' would save a call.
last = r_bracket + 1 index = input.find( '\\N{', last) else: # Finally failed gently, no longer finding a \N{... output = output + unicode( input[last:] ) return len(input), output
# Otherwise, we hit the break for an unterminated \N{...} return index, output
Note that .decode() must only return the decoded data. The "bytes read" integer was removed in order to make the Codec APIs compatible with the standard file object APIs.
if __name__ == '__main__': c = NameCodec() for s in [ r'b\lah blah \N{NULL} asdf', r'b\l\N{START OF HEADING}\N{NU' ]: used, s2 = c.decode(s) print repr( s2 )
s3 = c.encode(s) _, s4 = c.decode(s3) print repr(s3) assert s4 == s
print repr( c.decode(r'blah blah \N{NULLsadf} asdf' , errors='replace' )) print repr( c.decode(r'blah blah \N{NULLsadf} asdf' , errors='ignore' ))
makenamelist.py ===============
# Hack to extract character names from NamesList.txt # Output the repr() of the resulting dictionary
import re, sys, string
namedict = {}
while 1: L = sys.stdin.readline() if L == "": break
m = re.match('([0-9a-fA-F]){4}(?:\t(.*)\s*)', L) if m is not None: last_char = int(m.group(1), 16) if m.group(2) is not None: name = string.upper( m.group(2) ) if name not in ['<CONTROL>', '<NOT A CHARACTER>']: namedict[ name ] = last_char # print name, last_char
m = re.match('\t=\s*(.*)\s*(;.*)?', L) if m is not None: name = string.upper( m.group(1) ) names = string.split(name, ',') names = map(string.strip, names) for n in names: namedict[ n ] = last_char # print n, last_char
# XXX and do what with this dictionary? print namedict
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://www.python.org/mailman/listinfo/python-dev
-- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/