[I18n-sig] Binary data b"strings" (Pre-PEP: Proposed Python Character Model)

M.-A. Lemburg mal@lemburg.com
Sat, 10 Feb 2001 23:16:02 +0100

Paul Prescod wrote:
> I've thought about this coercion issue more...I think we need to
> auto-coerece these binary strings using some well-defined rule (NOT a
> default encoding!).
> "M.-A. Lemburg" wrote:
> >
> > > ...
> > >
> > > I would want to avoid the need for a 2.0-style 'default encoding', so I
> > > suggest it shouldnt be possible to mix this type with other strings:
> > >
> > > >>> "1"+b"2"
> > > Traceback (most recent call last):
> > >   File "<stdin>", line 1, in ?
> > > TypeError: cannot add type "binary" to string
> > > >>> "3"==b"3"
> > > 0
> >
> > Right. This will cause people to rethink whether they are
> > using the object for text data or binary data. I still think that
> > at the interface level, b"" and "" should be treated the same (except
> > that b""-strings should not implement the char buffer interface).
> If C functions auto-convert these things then people will coerce them by
> passing them through C functions. e.g. the regular expression engine or
> null encoding functions or whatever.
> If we do NOT auto-coerce these things then they will not be compatible
> with many parts of the Python infrastructure, the regular expression
> engine and codecs being the most important examples. A clear requirement
> from Andy Robinson was that string-like code should work on binary data
> because often binary strings are "really" un-decoded strings. I think he
> is speaking on behalf of a lot of serious internationalizers there.

b""-strings will expose all necessary APIs to be compatible with
the re-engine, with codecs and most other C level interfaces which
use the s or s# parser marker.

In reality the only breakage will be for code which explicitly
requests a string object and these instances should really be
modified to work using the above parser markers instead.

Given these semantics, auto-conversion is not really necessary
for b""-strings.

Note that I see b""-string as replacement for our current 8-bit
strings in the context of handling non-text data. 8-bit strings
should still remain intact and available (even after making
"" produce Unicode strings), but should be extended to provide
additional encoding information (see the small image I posted on
the "Storing encoding information" thread).

Marc-Andre Lemburg
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/