String encoding in Py2.7
Fabien LUCE
fabienluce at gmail.com
Tue May 29 05:19:52 EDT 2018
May 29 2018 11:12 AM, "Thomas Jollans" <tjol at tjol.eu> wrote:
> On 2018-05-29 09:55, ftg at lutix.org wrote:
>
>> Hello,
>> Using Python 2.7 (will switch to Py3 soon but Before I'd like to understand how string encoding
>> worked)
>
> Oh dear. This is probably the exact wrong way to go about it: the
> interplay between string encoding, unicode and bytes is much less clear
> and easy to understand in Python 2.
Ok I will quickly jump into py3 then.
>
>> Could you please tell me is I understood well what occurs in Python's mind:
>> in a .py file:
>> if I write s="héhéhé", if my file is declared as unicode coding, python will store in memory
>> s='hx82hx82hx82'
>
> No, it doesn't. At the very least, you're missing some backslashes – and
> I don't know of any character encoding that using 0x82 to encode é.
>
surprinsingly backslash were removed from my initial text...
ok so stored raw bytes are the one processed by the system encoder. If my console were utf-8 I would have same raw bytes string than you.
> On my system, I see
>
>>>> s = 'héhéhé'
>>>> s
>
> 'h\xc3\xa9h\xc3\xa9h\xc3\xa9'
>
> My system uses UTF-8. If your PC is set up to uses an encoding like ISO
> 8859-15 or Windows-1252, you should see
>
> 'h\xe9h\xe9h\xe9'
>
> The \x?? are just Python notation.
>
>> however this is not yet unicode for python interpreter this is just raw bytes. Right?
>
> Right, this is a bunch of bytes:
>
>>>> s
>
> 'h\xe9h\xe9h\xe9'
>
>>>> [ord(c) for c in s]
>
> [104, 233, 104, 233, 104, 233]
>
>>>> [hex(ord(c)) for c in s]
>
> ['0x68', '0xe9', '0x68', '0xe9', '0x68', '0xe9']
>
>>>>
>>
>> By the way, why 'h' is not turned into hexa value? Because it is already in the ASCII table?
>
> That's just how Python 2 likes to display stuff.
>
>> If I want python interpreter to recognize my string as unicode I have to declare it as unicode
>> s=u'héhéhé' and magically python will look for those
>> hex values 'x82' in the Unicode table. Still OK?
>
> In principle, the unicode table has nothing to do with anything here. It
> so happens that for some characters in some encodings the value is equal
> to the code point, but that's neither here nor there.
>
>> Now: how come when I declare s='héhéhé', print(s) displays well 'héhéhé'? Is it because of my shell
>> windows that is dealing well with unicode? Or is it
>> because the print function is magic?
>
> It's because the print statement is magic.
>
> Actually, this *only* works if the encoding of your file matches the
> default encoding required by your console. This is usually the case as
> long as you stay on the same PC, but this assumption can fall apart
> quite easily when you move code and data between systems, especially if
> they use different operating systems or (human) languages.
>
> Just use Python 3. There, the print function is not magic, which makes
> life so much more logical.
Thanks
>
> -- Thomas
> --
> https://mail.python.org/mailman/listinfo/python-list
More information about the Python-list
mailing list