Unicode literals and byte string interpretation.

Chris Angelico rosuav at gmail.com
Thu Oct 27 23:38:39 EDT 2011


On Fri, Oct 28, 2011 at 2:05 PM, Fletcher Johnson <flt.johnson at gmail.com> wrote:
> If I create a new Unicode object u'\x82\xb1\x82\xea\x82\xcd' how does
> this creation process interpret the bytes in the byte string? Does it
> assume the string represents a utf-16 encoding, at utf-8 encoding,
> etc...?
>
> For reference the string is これは in the 'shift-jis' encoding.

Encodings define how characters are represented in bytes. I think
probably what you're looking for is a byte string with those hex
values in it, which you can then turn into a Unicode string:

>>> a=b'\x82\xb1\x82\xea\x82\xcd'
>>> unicode(a,"shift-jis")    # use 'str' instead of 'unicode' in Python 3
u'\u3053\u308c\u306f'

The u'....' notation is for Unicode strings, which are not encoded in
any way. The last line of the above is a valid way of entering that
string in your source code, identifying Unicode characters by their
codepoints.

ChrisA



More information about the Python-list mailing list