"convert" string to bytes without changing data (encoding)

Peter Daum gator at cs.tu-berlin.de
Wed Mar 28 19:43:36 CEST 2012


On 2012-03-28 12:42, Heiko Wundram wrote:
> Am 28.03.2012 11:43, schrieb Peter Daum:
>> ... in my example, the variable s points to a "string", i.e. a series of
>> bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters.
> 
> No; a string contains a series of codepoints from the unicode plane,
> representing natural language characters (at least in the simplistic
> view, I'm not talking about surrogates). These can be encoded to
> different binary storage representations, of which ascii is (a common) one.
> 
>> What I am looking for is a general way to just copy the raw data
>> from a "string" object to a "byte" object without any attempt to
>> "decode" or "encode" anything ...
> 
> There is "logically" no raw data in the string, just a series of
> codepoints, as stated above. You'll have to specify the encoding to use
> to get at "raw" data, and from what I gather you're interested in the
> latin-1 (or iso-8859-15) encoding, as you're specifically referencing
> chars >= 0x80 (which hints at your mindset being in LATIN-land, so to
> speak).

... I was under the illusion, that python (like e.g. perl) stored
strings internally in utf-8. In this case the "conversion" would simple
mean to re-label the data. Unfortunately, as I meanwhile found out, this
is not the case (nor the "apple encoding" ;-), so it would indeed be
pretty useless.

The longer story of my question is: I am new to python (obviously), and
since I am not familiar with either one, I thought it would be advisory
to go for python 3.x. The biggest problem that I am facing is, that I
am often dealing with data, that is basically text, but it can contain
8-bit bytes. In this case, I can not safely assume any given encoding,
but I actually also don't need to know - for my purposes, it would be
perfectly good enough to deal with the ascii portions and keep anything
else unchanged.

As it seems, this would be far easier with python 2.x. With python 3
and its strict distinction between "str" and "bytes", things gets
syntactically pretty awkward and error-prone (something as innocently
looking like "s=s+'/'" hidden in a rarely reached branch and a
seemingly correct program will crash with a TypeError 2 years
later ...)

Regards,
                         Peter



More information about the Python-list mailing list