Novice: replacing strings with unicode variables in a list
John Machin
sjmachin at lexicon.net
Wed Dec 6 05:22:44 EST 2006
aine_canby at yahoo.com wrote:
> Hi,
>
> Im totally new to Python so please bare with me.
>
> Data is entered into my program using the folling code -
>
> str = raw_input(command)
> words = str.split()
>
> for word in words:
> word = unicode(word,'latin-1')
> word.encode('utf8')
The above statement produces a string in utf8 and then throws it away.
It does not update "word". To retain the utf8 string, you would have to
do word = word.encode('utf8') and in any case that won't update the
original list.
*** missing source code line(s) here ***
>
> This gives an error:
*** missing traceback lines here ***
> File "C:\Python25\lib\encodings\cp850.py", line 12, in encode
> return codecs.charmap_encode(input,errors,encoding_map)
> UnicodeEncodeError: 'charmap' codec can't encode character u'\x94' in
> position 0
> : character maps to <undefined>
No, it doesn't. You must have put "print word" to get the error that
you did.
*Please* when you are asking a question, copy/paste (1) the exact
source code that you ran (2) the exact traceback that you got.
>
> but the following works.
What do you mean by "works"? It may not have triggered an error, but on
the other hand it doesn't do anything useful.
>
> str = raw_input(command)
> words = str.split()
>
> for word in words:
> uni = u""
Above line is pointless. Removing it will have no effect
> uni = unicode(word,'latin-1')
> uni.encode('utf8')
Same problem as above -- utf8 string is produced and then thrown away.
>
> so the problem is that I want replace my list with unicode variables.
> Or maybe I should create a new list.
>
> I also tried this:
>
> for word in words[:]:
> word = u""
> word = unicode(word,'latin-1')
You got the error on the above statement because you are trying
(pointlessly) to decode the value u"". Decoding means to convert from
some encoding to unicode.
> word.encode('utf8')
Again, utf8 straight down the gurgler.
> print word
This (if executed) will try to print the UNICODE version, and die [as
in the 1st example] encoding the unicode in cp950, which is the
encoding for your Windows command console.
>
> but got TypeError: decoding Unicode is not supported.
>
> What should I be doing?
(1) Reading the Unicode howto: http://www.amk.ca/python/howto/
(2) Writing some code like this:
| >>> strg = "\x94 foo bar zot"
| >>> words = strg.split()
| >>> words
| ['\x94', 'foo', 'bar', 'zot']
| >>> utf8words = [unicode(word, 'latin1').encode('utf8') for word in
words]
| >>> utf8words
| ['\xc2\x94', 'foo', 'bar', 'zot']
| >>>
HTH,
John
More information about the Python-list
mailing list