Novice: replacing strings with unicode variables in a list

Wed Dec 6 05:22:44 EST 2006

aine_canby at yahoo.com wrote:
> Hi,
>
> Im totally new to Python so please bare with me.
>
> Data is entered into my program using the folling code -
>
> str = raw_input(command)
> words = str.split()
>
> for word in words:
>   word = unicode(word,'latin-1')
>   word.encode('utf8')

The above statement produces a string in utf8 and then throws it away.
It does not update "word". To retain the utf8 string, you would have to
do word = word.encode('utf8') and in any case that won't update the
original list.

*** missing source code line(s) here ***

>
> This gives an error:

*** missing traceback lines here ***

>   File "C:\Python25\lib\encodings\cp850.py", line 12, in encode
>     return codecs.charmap_encode(input,errors,encoding_map)
> UnicodeEncodeError: 'charmap' codec can't encode character u'\x94' in
> position 0
> : character maps to <undefined>

No, it doesn't. You must have put "print word" to get the error that
you did.
*Please* when you are asking a question, copy/paste (1) the exact
source code that you ran (2) the exact  traceback that you got.

>
> but the following works.

What do you mean by "works"? It may not have triggered an error, but on
the other hand it doesn't do anything useful.

>
> str = raw_input(command)
> words = str.split()
>
> for word in words:
>   uni = u""

Above line is pointless. Removing it will have no effect

>   uni = unicode(word,'latin-1')
>   uni.encode('utf8')

Same problem as above -- utf8 string is produced and then thrown away.

>
> so the problem is that I want replace my list with unicode variables.
> Or maybe I should create a new list.
>
> I also tried this:
>
> for word in words[:]:
> 					word = u""
> 					word = unicode(word,'latin-1')

You got the error on the above statement because you are trying
(pointlessly) to decode the value u"". Decoding means to convert from
some encoding to unicode.

> 					word.encode('utf8')

Again, utf8 straight down the gurgler.

> 					print word

This (if executed) will try to print the UNICODE version, and die [as
in the 1st example] encoding the unicode in cp950, which is the
encoding for your Windows command console.

>
> but got TypeError: decoding Unicode is not supported.
>
> What should I be doing?

(1) Reading the Unicode howto: http://www.amk.ca/python/howto/

(2) Writing some code like this:

| >>> strg = "\x94 foo bar zot"
| >>> words = strg.split()
| >>> words
| ['\x94', 'foo', 'bar', 'zot']
| >>> utf8words = [unicode(word, 'latin1').encode('utf8') for word in
words]
| >>> utf8words
| ['\xc2\x94', 'foo', 'bar', 'zot']
| >>>

HTH,
John