create a string of variable lenght

Benjamin Kaplan benjamin.kaplan at case.edu
Mon Feb 1 01:54:17 CET 2010


On Sun, Jan 31, 2010 at 5:12 PM, Tracubik <affdfsdfdsfsd at b.com> wrote:
> Il Sun, 31 Jan 2010 13:46:16 +0100, Günther Dietrich ha
> scritto:
>
>> Maybe you might solve this if you decode your string to unicode.
>> Example:
>>
>> |>>> euro = "€"
>> |>>> len(euro)
>> |3
>> |>>> u_euro = euro.decode('utf_8')
>> |>>> len(u_euro)
>> |1
>>
>> Adapt the encoding ('utf_8' in my example) to whatever you use.
>>
>> Or create the unicode string directly:
>>
>> |>>> u_euro = u'€'
>> |>>> len(u_euro)
>> |1
>>
>>
>>
>> Best regards,
>>
>> Günther
>
> thank you, your two solution is really interesting.
> is there a possible to set unicode encoding by default for my python
> scripts?
> i've tried inserting
> # -*- coding: utf-8 -*-
>
> at the beginning of my script but doesn't solve the problem


First of all, if you haven't read this before, please do. It will make
this much clearer.
http://www.joelonsoftware.com/articles/Unicode.html

To reiterate: UTF-8 IS NOT UNICODE!!!!

In Python 2, '*' signifies a byte string. It is read as a sequence of
bytes and interpreted as a sequence of bytes When Python encounters
the sequence 0x27 0xe2 0x82 0xac 0x27 in the code (the UTF-8 bytes for
'€') it interprets it as 3 bytes between the two quotes. It doesn't
care about characters or anything like that. u'*' signifies a Unicode
string. Python will attempt to convert the sequence of bytes into a
sequence of characters. It can use any encoding for that: cp1252,
utf-8, MacRoman, ISO-8859-15. UTF-8 isn't special, it's just one of
the few encodings capable of storing all of the possible Unicode
characters.

What the line at the top says is that the file should be read using
UTF-8. Byte strings are still just sequences of bytes- this doesn't
affect them. But any Unicode string will be decoded using UTF-8. IF
python looks at the above sequence of bytes as a Unicode string, it
views the 3 bytes as a single character. When you ask for it's length,
it returns the number of characters.

Solution to your problem: in addition to keeping the #-*- coding ...
line, go with Günther's advice and use Unicode strings.
> --
> http://mail.python.org/mailman/listinfo/python-list
>



More information about the Python-list mailing list