I'm working on converting Objects/bytearray.c and Objects/bytes.c.
For bytes, the strip methods need a "self converter" so that they get
a PyBytesObject* instead of PyObject*. However, having set this in
bytes.strip and "cloning" that clinic definition for bytes.lstrip and
bytes.rstrip, it appears that the self converter wasn't set on lstrip
and rstrip. Removing the cloning and copying the argument definitions
resolved the issue.
Is this a bug?
Here's the text for your reading pleasure. I'll commit the PEP after I add some markup.
- dropped `format` support, just using %-interpolation
- Rationale section ;)
Title: Adding % formatting to bytes
Author: Ethan Furman <ethan(a)stoneleaf.us>
Type: Standards Track
Post-History: 2014-01-14, 2014-01-15, 2014-01-17
This PEP proposes adding % formatting operations similar to Python 2's str type
to bytes _ _.
In order to avoid the problems of auto-conversion and Unicode exceptions that
could plague Py2 code, all object checking will be done by duck-typing, not by
values contained in a Unicode representation _.
Proposed semantics for bytes formatting
All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.)
will be supported, and will work as they do for str, including the
padding, justification and other related modifiers.
>>> b'%4x' % 10
>>> '%#4x' % 10
>>> '%04X' % 10
%c will insert a single byte, either from an int in range(256), or from
a bytes argument of length 1, not from a str.
>>> b'%c' % 48
>>> b'%c' % b'a'
%s is restricted in what it will accept::
- input type supports Py_buffer?
use it to collect the necessary bytes
- input type is something else?
use its __bytes__ method; if there isn't one, raise a TypeError
>>> b'%s' % b'abc'
>>> b'%s' % 3.14
Traceback (most recent call last):
TypeError: 3.14 has no __bytes__ method
>>> b'%s' % 'hello world!'
Traceback (most recent call last):
TypeError: 'hello world' has no __bytes__ method, perhaps you need to encode it?
Because the str type does not have a __bytes__ method, attempts to
directly use 'a string' as a bytes interpolation value will raise an
exception. To use 'string' values, they must be encoded or otherwise
transformed into a bytes sequence::
Numeric Format Codes
To properly handle int and float subclasses, int(), index(), and float()
will be called on the objects intended for (d, i, u), (b, o, x, X), and
(e, E, f, F, g, G).
%r (which calls __repr__), and %a (which calls ascii() on __repr__) are not
It was suggested to let %s accept numbers, but since numbers have their own
format codes this idea was discarded.
It has been suggested to use %b for bytes instead of %s.
- Rejected as %b does not exist in Python 2.x %-interpolation, which is
why we are using %s.
It has been proposed to automatically use .encode('ascii','strict') for str
arguments to %s.
- Rejected as this would lead to intermittent failures. Better to have the
operation always fail so the trouble-spot can be correctly fixed.
It has been proposed to have %s return the ascii-encoded repr when the value
is a str (b'%s' % 'abc' --> b"'abc'").
- Rejected as this would lead to hard to debug failures far from the problem
site. Better to have the operation always fail so the trouble-spot can be
Originally this PEP also proposed adding format style formatting, but it was
decided that format and its related machinery were all strictly text (aka str)
based, and it was dropped.
Various new special methods were proposed, such as __ascii__, __format_bytes___,
etc.; such methods are not needed at this time, but can be visited again later
if real-world use shows deficiencies with this solution.
..  http://docs.python.org/2/library/stdtypes.html#string-formatting
..  neither string.Template, format, nor str.format are under consideration.
..  %c is not an exception as neither of its possible arguments are unicode.
This document has been placed in the public domain.
As I see it, there are two separate goals in adding formatting
methods to bytes. One is to make it easier to write new programs
that manipulate byte data. Another is to make it easier to upgrade
Python 2.x programs to Python 3.x. Here is an idea to better
address these separate goals.
Introduce %-interpolation for bytes. Support the following format
codes to aid in writing new code:
%b: insert arbitrary bytes (via __bytes__ or Py_buffer)
%[dox]: insert an integer, encoded as ASCII
%[eEfFgG]: insert a float, encoded as ASCII
%a: call ascii(), insert result
Add a command-line option, disabled by default, that enables the
following format codes:
%s: if the object has __bytes__ or Py_buffer then insert it.
Otherwise, call str() and encode with the 'ascii' codec
%r: call repr(), encode with the 'ascii' codec
%[iuX]: as per Python 2.x, for backwards compatibility
Introducing these extra codes and the command-line option will
provide a more gradual upgrade path. The next step in porting could
be to examine each %s inside bytes literals and decide if they
should either be converted to %b or if the literal should be
converted to a unicode literal. Any %r codes could likely be safely
changed to %a.
I hope you didn't mean to take this off-list:
On Fri, Jan 17, 2014 at 2:06 PM, Neil Schemenauer <nas(a)arctrix.com> wrote:
> In gmane.comp.python.devel, you wrote:
> > For the record, we've got a pretty good thread (not this good, though!)
> > over on the numpy list about how to untangle the mess that has resulted
> Not sure about your definition of good. ;-)
well, in the sense of "big" anyway...
> Could you summarize the main points on python-dev? I'm not feeling up to
> wading through
> another massive thread but I'm quite interested to hear the
> challenges that numpy deals with.
Well, not much new to it, really. But here's a re-cap:
numpy has had an 'S' dtype for a while, which corresponded to the py2
string type (except for being fixed length). So it could auto-convert
to-from python strings... all was good and happy.
Enter py3: what to do? there is no py2 string type anymore. So it was
decided to have the 'S' dtype correspond to the py3 bytes
type. Apparently there was thought of renaming it, but the 'B' and 'b'
type identifiers were already takes, so 'S' was kept.
However, as we all know in this thread, the py3 bytes type is not the same
thing as a py2 string (or py2 bytes, natch), and folks like to use the 'S'
type for text data -- so that is kind of broken in py3.
However, other folks use the 'S' type for binary data, so like (and rely
on) it being mapped to the py3 bytes type. So we are stuck with that.
Given the nature of numpy, and scientific data, there is talk of having a
one-byte-per-char text type in numpy (there is already a unicode type, but
it uses 4-bytes-per-char, as it's key to the numpy data model that all
objects of a given type are the same size.) This would be analogous to the
current multiple precision options for numbers. It would take up less
memory, and would not be able to hold all values. It's not clear what the
level of support is for this right now -- after all, you can do everything
you need to do with the appropriate calls to encode() and decode(), if a
Meanwhile, back at the ranch -- related, but separate issues
have arisen with the functions that parse text files: numpy.loadtxt and
numpy.genfromtxt. These functions were adapted for py3 just enough to get
things to mostly work, but have some serious limitations when doing
anything with unicode -- and in fact do some weird things with plain ascii
text files if you ask it to create unicode objects, and that is a natural
thing to do (and the "right" thing to do in the Py3 text model) if you do
arr = loadtxt('a_file_name', dtype=str)
on py3, an str is a py3unicode string, so you get the numpy 'U' datatype
but loadtxt wasn't designed to deal with that, so you can get stuff like:
This was (Presumably, I haven't debugged the code) due to conversion from
bytes to unicode...(I'm still confused about the extra slashes)
And this ascii text -- it gets worse if there is non-ascii text in there.
Anyway, the truth is, this stuff is hard, but it will get at least a touch
easier with PEP 461.
[though to be truthful, I'm not sure why someone put a comment in the issue
tracker about b'%d'%some_num being an issue ... I'm not sure how when we're
going from text to numbers, not the other way around...]
Christopher Barker, Ph.D.
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
One of the downsides of converting positional-only functions to Argument
Clinic is that it can result in misleading docstring signatures. Example:
The problem with the new signature is that it indicates passing None for
protocolname is the same as omitting it (the other, much larger problem is
that it falsely indicates keyword compatibility, but that's a separate
Is it OK to change a longstanding function to treat None like an absent
parameter, where previously it was an error? (This also entails a docs
update and maybe a changelog entry)
The current tally of votes, by order of popularity:
Side file: +6
Multiple buffers, Modified buffer, Forward buffer: +1
However, as stated, support for "side files" will not go in unless Guido
explicitly states that it's okay with him. He has not. Therefore it's
not going in. If you want this feature, take it up with our BDFL. I
feel my hands are tied.
Second-best is all the buffer approaches, collectively. Since there was
no clear winner, I'm going to make the new default the "modified buffer"
approach, as that's the only one that does not require rearranging your
code to use. However, to encourage continued experimentation, I'm going
to leave in the configurability (at least for now), so people can keep
experimenting. Maybe we'll find something in the future that's a clear
As a stretch goal, I'd like to also add Zachary Ware's proposed
"forward" buffer, as a further concession to experimentation. It
shouldn't be too messy, but if it gets out of hand I'll back out of it.
Finally, I'm going to add support for "presets" so you can switch
between original / modified buffer / buffer / forward buffer with just
one statement. (Multiple buffers doesn't need a different preset.)
I'll also keep the line prefix (and add a line suffix too) and see if a
prefix of "/*clinic*/" helps.
The whole discussion of whether clinic should write its output
right in the source file (buffered or not), or in a separate sidefile,
started because we currently cannot run the clinic during the build
process, since it’s written in python.
But what if, at some point, someone implements the Tools/clinic.py in
pure C, so that integrating it directly in the build process will be
possible? In this case, the question is — should we use python code
in the argument clinic DSL?
If we keep it strictly declarative, then, at least, we’ll have this
possibility in the future.
On 01/16/2014 04:49 AM, Michael Urman wrote:
> On Thu, Jan 16, 2014 at 1:52 AM, Ethan Furman <ethan(a)stoneleaf.us> wrote:
>>> Is this an intended exception to the overriding principle?
>> Hmm, thanks for spotting that. Yes, that would be a value error if anything
>> over 255 is used, both currently in Py2, and for bytes in Py3. As Carl
>> suggested, a little more explanation is needed in the PEP.
> FYI, note that str/unicode already has another value-dependent
> exception with %c. I find the message surprising, as I wasn't aware
> Python had a 'char' type:
>>>> '%c' % 'a'
>>>> '%c' % 'abc'
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> TypeError: %c requires int or char
Python doesn't have a char type, it has str's of length 1... which are usually referred to as char's. ;)