String encoding in Py2.7

Tue May 29 05:19:52 EDT 2018

May 29 2018 11:12 AM, "Thomas Jollans" <tjol at tjol.eu> wrote:
> On 2018-05-29 09:55, ftg at lutix.org wrote:
> 
>> Hello,
>> Using Python 2.7 (will switch to Py3 soon but Before I'd like to understand how string encoding
>> worked)
> 
> Oh dear. This is probably the exact wrong way to go about it: the
> interplay between string encoding, unicode and bytes is much less clear
> and easy to understand in Python 2.

Ok I will quickly jump into py3 then.

> 
>> Could you please tell me is I understood well what occurs in Python's mind:
>> in a .py file:
>> if I write s="héhéhé", if my file is declared as unicode coding, python will store in memory
>> s='hx82hx82hx82'
> 
> No, it doesn't. At the very least, you're missing some backslashes – and
> I don't know of any character encoding that using 0x82 to encode é.
> 
 surprinsingly backslash were removed from my initial text...
ok so stored raw bytes are the one processed by the system encoder. If my console were utf-8 I would have same raw bytes string than you. 

> On my system, I see
> 
>>>> s = 'héhéhé'
>>>> s
> 
> 'h\xc3\xa9h\xc3\xa9h\xc3\xa9'
> 
> My system uses UTF-8. If your PC is set up to uses an encoding like ISO
> 8859-15 or Windows-1252, you should see
> 
> 'h\xe9h\xe9h\xe9'
> 
> The \x?? are just Python notation.
> 
>> however this is not yet unicode for python interpreter this is just raw bytes. Right?
> 
> Right, this is a bunch of bytes:
> 
>>>> s
> 
> 'h\xe9h\xe9h\xe9'
> 
>>>> [ord(c) for c in s]
> 
> [104, 233, 104, 233, 104, 233]
> 
>>>> [hex(ord(c)) for c in s]
> 
> ['0x68', '0xe9', '0x68', '0xe9', '0x68', '0xe9']
> 
>>>> 
>> 
>> By the way, why 'h' is not turned into hexa value? Because it is already in the ASCII table?
> 
> That's just how Python 2 likes to display stuff.
> 
>> If I want python interpreter to recognize my string as unicode I have to declare it as unicode
>> s=u'héhéhé' and magically python will look for those
>> hex values 'x82' in the Unicode table. Still OK?
> 
> In principle, the unicode table has nothing to do with anything here. It
> so happens that for some characters in some encodings the value is equal
> to the code point, but that's neither here nor there.
> 
>> Now: how come when I declare s='héhéhé', print(s) displays well 'héhéhé'? Is it because of my shell
>> windows that is dealing well with unicode? Or is it
>> because the print function is magic?
> 
> It's because the print statement is magic.
> 
> Actually, this *only* works if the encoding of your file matches the
> default encoding required by your console. This is usually the case as
> long as you stay on the same PC, but this assumption can fall apart
> quite easily when you move code and data between systems, especially if
> they use different operating systems or (human) languages.
> 
> Just use Python 3. There, the print function is not magic, which makes
> life so much more logical.

Thanks
> 
> -- Thomas
> --
> https://mail.python.org/mailman/listinfo/python-list