<div dir="ltr"><div style>This has all gotten a bit complicated because everyone has been thinking in terms of actual encodings and actual text files. But I think the use-case here is something different:</div><div style><br>
</div><div style>A file with a bunch of bytes in it, _some_of which are ascii, and the rest are other bytes (maybe binary data, maybe non-ascii-encoded text).</div><div style><br></div><div style>I think this is the use-case that "just worked" in py2, but doesn't in py3 -- i.e. in py3 you have to choose either the binary interpretation or the ascii one, but you can't have both. If you choose ascii, it will barf when you try to decode it, if you choose binary, you lose the ability to do simple stuff with the ascii subset -- parsing, substitution, etc.</div>
<div style><br></div><div style>Some folks have suggested using latin-1 (or other 8-bit encoding) -- is that guaranteed to work with any binary data, and round-trip accurately?</div><div style><br></div><div style>and will surrogateescape work for arbitrary binary data?</div>
<div style><br></div><div style>If this is a common need, then it would be nice for py3 to address. I know that I work with a couple file formats that have text headers followed by binary data (not as hard to deal with, but still harder in py3). And from this discussion , it seems that "wire protocols" commonly mix ascii and binary.</div>
<div style><br></div><div style>So the decisions to be made:</div><div style><br></div><div style>Is this a use-case worth supporting in the standard library?</div><div style><br></div><div style>If so, how?</div><div style>
1) add some of the basic stuff to the bytes object - i.e. string formatting, what this all started with.</div><div style> 2) create a custom encoding that could losslessly convert to from this mixture to/from a unicode object. I</div>
<div style>'m not sure if that is even possible, but it would be kind of cool.</div><div style> 3) create a new object, neither a string nor a bytes object that did what we want (it would look a lot like the py2 string...)</div>
<div style> 4) create a module for doing the stuff wanted with a bytes object (not very OO)</div><div style><br></div><div style>Does that clarify the discussion at all?</div><div style><br></div>On Thu, Jan 9, 2014 at 2:15 AM, Kristján Valur Jónsson <span dir="ltr"><<a href="mailto:kristjan@ccpgames.com" target="_blank">kristjan@ccpgames.com</a>></span> wrote:<br>
<div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div class="im">
This is the python 2 program:<br></div>
with open(fn1) as f1:<br>
with open(fn2, 'w') as f2:<br>
f2.write(process_text(f1.read())<br></blockquote><div><br></div><div style>I think the key point here is that this worked because a common case was ascii text and arbitrary binary mixed. As long as all the process_text() stuff is ascii only, that would work, either with arbitrary binary data or ascii-compatible encoding. The fact that it would NOT work with arbitrarily encoded data doesn't mean it's not useful for this special, but perhaps common, case.</div>
<div style><br></div></div>-- <br><br>Christopher Barker, Ph.D.<br>Oceanographer<br><br>Emergency Response Division<br>NOAA/NOS/OR&R (206) 526-6959 voice<br>7600 Sand Point Way NE (206) 526-6329 fax<br>
Seattle, WA 98115 (206) 526-6317 main reception<br><br><a href="mailto:Chris.Barker@noaa.gov" target="_blank">Chris.Barker@noaa.gov</a>
</div></div>