Trouble with unicode
Charlie Clark
charlie at begeistert.org
Tue May 15 04:03:54 EDT 2001
>>they look about 3 characters long but are only 1 really, I already have
>>experience converting Unix characters over.
>
>Sounds like UTF-8. If it is, you can just replace 'latin-1' below with
>'utf-8' :-)
UnicodeError: UTF-8 decoding error: invalid data
mm, how can I check to see what type of encoding it is?
>> But I'm no closer, am I? I don't quite understand what the
>> codecs module is
>> and how it works.
>
>You're closer :-)
thanx. There's probably a good reason for all the smoke and mirrors but I
don't see why I can't do simple encode, decodes and it's going to take a
while working out how this lookup works - I would call it cryptic even if
though I realise it's being very OO.
>
>OK, it looks like you are starting with a string containing Latin-1
>characters. If I understand correctly, you want to remove the characters
>that are not in the ASCII set (i.e > 127). There are two ways to do that:
>
>1. Fancy (change 'latin-1' to the actual encoding):
>
>>>> from codecs import lookup
>>>> fromLatin1 = lookup( 'latin-1' )[1]
>>>> toASCII = lookup( 'ASCII' )[0]
>>>> asLatin1, dummy = fromLatin1( '\xe4, \xc4, \xf6, \xd6, \xfc, \xdc,
>\xdf' )
>>>> toASCII( asLatin1, 'replace' )
>('?, ?, ?, ?, ?, ?, ?', 19)
It works. But I don't want a load of question marks! I want the special
characters. I particularly want to be able to replace them all at once.
This is how I've previously done this:
def unix_to_unicode(text):
special = {"\xc4": "Ä", "\xe4": "ä", "\xd6" : "Ö", "\xf6" : "ö", "\xdc" :
"ü", "\xfc" : "ü", "\xdf" : "ß"}
for key in special.keys():
text = text.replace(key, special[key])
return text
This would work fine with single character entities so I will be able to work
around this.
Maybe I should provide the context for my work. I've written a script which
reads orders which come via e-mail, and writes the significant data to file
attributes generating a pseudo database in the file system. This is all BeOS
specific but I'm using Python for it all 'cos it's the only way I'll ever
understand what I'm doing!
Thanx again for your help!
Charlie
--
Charlie Clark
Helmholtzstr. 20
Düsseldorf
D- 40215
Tel: +49-211-938-5360
http://www.begeistert.org
More information about the Python-list
mailing list