"convert" string to bytes without changing data (encoding)

Ethan Furman ethan at stoneleaf.us
Wed Mar 28 14:17:56 EDT 2012


Peter Daum wrote:
> On 2012-03-28 12:42, Heiko Wundram wrote:
>> Am 28.03.2012 11:43, schrieb Peter Daum:
>>> ... in my example, the variable s points to a "string", i.e. a series of
>>> bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters.
>> No; a string contains a series of codepoints from the unicode plane,
>> representing natural language characters (at least in the simplistic
>> view, I'm not talking about surrogates). These can be encoded to
>> different binary storage representations, of which ascii is (a common) one.
>>
>>> What I am looking for is a general way to just copy the raw data
>>> from a "string" object to a "byte" object without any attempt to
>>> "decode" or "encode" anything ...
>> There is "logically" no raw data in the string, just a series of
>> codepoints, as stated above. You'll have to specify the encoding to use
>> to get at "raw" data, and from what I gather you're interested in the
>> latin-1 (or iso-8859-15) encoding, as you're specifically referencing
>> chars >= 0x80 (which hints at your mindset being in LATIN-land, so to
>> speak).
> 
> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x. The biggest problem that I am facing is, that I
> am often dealing with data, that is basically text, but it can contain
> 8-bit bytes. In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.

Where is the data coming from?  Files?  In that case, it sounds like you 
will want to decode/encode using 'latin-1', as the bulk of your text is 
plain ascii and you don't really care about the upper-ascii chars.

~Ethan~



More information about the Python-list mailing list