__str__ vs. __repr__
Tim Peters
tim_one at email.msn.com
Fri Nov 5 02:04:30 EST 1999
[amazingly enough, still sticking to what the subject line sez <ahem>!]
Let me back off to what repr and str "should do":
repr(obj) should return a string such that
eval(repr(obj)) == obj
I'd go further and formalize a property that's currently honored but
implicitly: the string returned by repr should contain only characters c
such that
c in "\t\n" or 32 <= ord(c) < 127
i.e. it should be restricted to the set of 7-bit ASCII characters that the C
std guarantees can be read back in faithfully in text mode. Together, those
properties spell out what repr() is "good for": a highly portable and
eval'able text representation that captures *everything* important about an
object's value.
In contrast, str() should return a string that's pleasant for people to
read. This may involve suppressing obscure or minor details, displaying it
in a form that's inefficient-- or impossible --to process, or even plain
lying. Examples include the Rat class I used before (the ratio of two huge
integers is exact but unhelpful), a DateTime class that stores times as
seconds from the epoch (ditto), or a Matrix class whose instances may
contain megabytes of information (and so a faithful repr() string may be
intolerable almost all the time).
Viewed that way, I'd collapse all of your (Guido's) characterizations of
what I'm after into one: repr should never be invoked implicitly. If you
want a faithful (but often unhelpful) string, use repr(obj) or `obj`
explicitly. Part of not invoking repr implicitly includes containers not
inventing repr() calls out of thin air <wink>.
David Ascher suggested that str(long) drop the trailing "L", and that's a
good example. The "L" is appropriate for repr(), for so long as Python
maintains such a sharp distinction between ints and longs, but is really of
no help to most users most of the time.
repr() currently cheats in another way: for purely aesthetic reasons,
repr(float) and repr(complex) don't generate enough digits to allow exact
reconstruction of their arguments (assuming high-quality float<->string in
the platform libc). This is a case where an argument appropriate to str()
was applied to repr(), not because it makes *sense* for repr, but because so
much output goes thru an *implicit* repr() now and you didn't want to see
all those "ugly" long float strings <0.50000000000000079 wink>.
repr(float) should do its job correctly here; str(float) should continue to
produce the "nice" (but numerically inadequate) strings. (BTW, on an
IEEE-754 platform repr(float) should generate up to 17 significant digits
(suppression of trailing zeroes is fine) -- the 754 std guarantees that if
conforming output generates that many, a conforming input routine will
exactly reconstruct the original value -- note that this does *not* require
best-possible conversion in either direction -- it turns out that 17 is
enough for somewhat sloppy conversions to succeed, and many platforms are up
to this easier task)
Onward. Or backward:
[Tim]
>> In a world where containers passed str() down, a container's
>> str() would presumably be responsible for adding disambiguating
>> delimeters to element str() results when needed (the container
>> knows its own output syntax, and can examine the strings produced
>> by its elements ...
[Guido]
> Hm... What kind of things would you expect e.g. the list str() to do
> to its item str()s? Put backslashes before commas?
This depends on how seriously you take all this <wink>.
If you take it very seriously, __str__ should take a second flag argument,
saying whether the consumer of the string will be embedding the string in a
larger context or displaying it directly. Then it's up to __str__ to put
its own brand of delimiters around its output when appropriate.
I suspect that's overkill, though -- that the problem here is very specific
to a string object's __str__ (containers will put matching brackets at each
end of their str() or repr() output regardless, numbers of various sort
don't need delimiters, and a user class __str__ generally produces a highly
stylized string -- seems that it's *only* string objects that have wildly
unpredictable display forms).
If true, it may work fine just to be simple and pragmatic: let containers
special-case the snot out of elements of string type, sticking a pair of
quotes around what string's __str__ returns and backslash-escaping any
embedded quotes of the same type.
Then
>>> names = ["François", "Tim", "Gu'ido"]
>>> names # same as str(names) in this make-believe world
['François', 'Tim', 'Gu\'ido']
or
['François', 'Tim', "Gu'ido"]
is what I'd expect.
I certainly don't want to get rid of the commas, colons, parens, braces and
brackets for list, tuple and dict str() output: they're very helpful! I
just want to be able to read what's *between* all that stuff.
> These are all good points.
Ya, well ... everything sounds kinda good before it's implemented <0.6
wink>.
> In a typical scenario which I wanted to avoid, a user has a variable
> containing the string '1' but mistakenly believes that it contains the
> integer 1. (This happens a lot, e.g. it could be read from a file
> containing numbers.) The user tries various numeric operations on the
> variable and they all raise exceptions. The user is inexperienced and
> doesn't understand what the exceptions are, but gets the idea to
> display its value to see if something's wrong with it. One of the
> first things users learn is to use interactive Python as a power
> calculator, so my hypothetical user just types the name of the
> variable. If this would use str() to format the value, the user is no
> wiser, and perhaps more confused, since str('1') is the same as
> str(1).
Understood and appreciated. Using the pragmatic approach above, I would
have PRINT_EXPR call the same routine that container str() implementations
call, i.e. one that special-cases the snot out of strings. Then
>>> x, y = 1, '1'
>>> x
1
>>> y
'1'
>>> print x
1
>>> print y
1
>>>
is what I'd expect.
> ...
> We can then argue over what str() of a list L should return; one
> extreme possibility would be to return string.join(map(str, L));
I definitely don't want to lose the brackets or the commas.
> a slightly less radical solution would be
> '[' + string.join(map(str, L), ', ') + ']'
Yes, but replacing str with str_special_casing_the_snot_out_of_strings.
> In the first case, your last example would go like this:
>
> >>> names
> François Tim
> >>>
Yuck!
> while the [second] choice would give
>
> >>> names
> [François, Tim]
> >>>
As above, ['François', 'Tim'] is my ideal.
> There may be other solutions -- e.g. in Tcl, a list is displayed as
> the items separated by spaces, with the proviso that items containing
> spaces are displayed inside (matching) curly braces; unmatched braces
> are displayed using backslashes, guaranteeing that the output can be
> parsed back into a list with the same value as the original. (Hey!
> That's the same as Python's rule! So why does it work in Tcl?
> Because variables only contain strings, and the equivalent of the Rat
> class used above can't be coded in Tcl.)
I'm explicitly giving up the property that str(obj) always be eval'able in a
sensible way. That's repr()'s job. For example, I expect
>>> x = "\\t"
>>> x
'\t'
>>>
and eval('\t') certainly isn't x. This is a tradeoff: we stop pretending
that str() is adequate for equality under evalability <wink>, but take EUE
seriously for repr(). In return, str() gets to do what it was meant to do:
produce "nice" strings, without compromise for the sake of faux
(inconsistent, unpredictable) EUE.
> The problem with displaying 'François' has been mentioned to me
> before. (By the way, I have no idea how to *type* that. I'm just
> cutting and pasting it from Tim's message.)
European keyboard, programmable keyboard, C-q something-or-other in Emacs,
Alt+0231 in Windows (keyboard generation of any 8-bit code is built in to
Windows keyboard drivers), or even copy+paste from Accessories->Character
Map under Windows. My employer spends a lot of time worrying about other
countries' foolish character sets <wink>.
> There's another scenario I was trying to avoid. This is probably
> something that happened once too many times when I was young and
> innocent, so I may be overracting. Consider the following:
>
> >>> f = open("somefile")
> >>> a = f.readline()
> >>> print a
> %âãÏÓ1.3
>
> >>>
>
> Now this example is relatively harmless. Using repr(), I see that the
> string contains a \r character that caused the cursor to back up to
> the start of the line, overwriting what was already written:
>
> >>> a
> '%PDF-1.3\015%\342\343\317\323\015\012'
> >>>
>
> But the thing that some big bad program did to me long ago was more
> like spit out several thousand garbage bytes which contained enough
> escape sequences to lock up my terminal requiring me to powercycle and
> log in again. (The fact that the story refers to a terminal indicates
> how long ago this was. :-)
>
> So I vowed that *my* language would not (easily) let this happen by
> accident, and the way I enforced that was by making sure that all
> non-ASCII characters would be printed as octal escapes, unless you use
> the print statement.
An irony is that I've locked up a DOS box doing this on many occasions -- in
a script, "print a" is not exactly a *hard* mistake to fall into <wink>.
I'm not unsympathetic, but this is something you really can't stop without
rendering Python useless to grownups. How does Python's Unicode story
relate to this? Displaying a Unicode string as a pile of \uxxxx (whatever)
escapes defeats the whole purpose of Unicode. Ditto displaying it in UTF-7.
OTOH, for some number of years to come it will be a rare display device that
*won't* treat Unicode 16-bit values (whether or not UTF-8 encoded) as a pile
of 8-bit control codes.
> It's a separate dilemma from the other examples. My problem here is
> that I hate to make assumptions about the character set in use. How
> do I know that '\237' is unprintable but '\241' is a printable
> character?
We have no idea.
> How do I know that the latter is an upside-down exclamation point?
Ditto.
> Should I really assume stdout is capable of displaying Latin-1?
No. But I don't grasp why you think you need to know *anything* about this.
Until Unicode takes over the world, there's nothing you can do other than
tell users the truth: "most of" printable 7-bit ASCII displays the same way
across the world, but outside of that *all* bets are off. It will vary by
all of OS, display device, displaying program, and user configuration
choices.
> Strictly, the str() function doesn't even know that it's output is
> going to stdout.
That's right, and C doesn't even guarantee that I can read François back in
from a text file! It's that bad. *So* bad that everyone lives with it and
never notices it <0.9 wink>.
> I suppose I could use isprint(), and then François could use the locale
> module in his $PYTHONSTARTUP file to make it do the right thing. Is that
> good enough?
I don't know what you're trying to *accomplish* here. isprint() is a piece
of crap, not least because what can or can't be displayed has much more to
do with the font you're currently using than with anything C knows about.
That is, it's outside C's areas of competence.
For example, here's the relatively new euro symbol: . That's not in
Latin-1. Windows added it as an extension to Latin-1, at hex code 0x80 (one
of the 32 codes Latin-1 didn't assign glyphs to). In my mailer at the
moment, even on Windows, it shows up as a square box. That's because my
font at the moment happens to be Courier. If I switch my font to Courier
New, it shows up as intended -- but even then only because I happened to
install the service pack that added the euro symbol, and Courier New (unlike
Courier) is a "code page 1252" font. On any non-Windows system, God only
knows what you'll see. The point is that C's isprint(0x80) can't give a
reasonable answer even sticking solely to my machine! What I can or can't
display has, alas, nothing to do with C.
> (I just tried this briefly. It seems that somehow the locale module
> doesn't affect this?!?!
Locale is supposed to affect all of C's isxxx functions, but I wouldn't bet
that most implementations do it correctly. As above, even if they did,
what's actually displayable is a different issue.
> I still think that ['a', 'b'] should be displayed like that, and not
> like [a, b].
Me too! I don't want to make the currently-readable unreadable -- I want to
keep that and get the converse too.
> I'm not sure what that to do about a dict of Rat() items,
str(recip) (my example) produces {0.100: 1.000, 1.000: 0.100} (which follows
automatically from str(dict) "passing str() down", and that neither the keys
nor the values are themselves of string type); repr(recip) produces the
unreadable but precise and eval'able string that's produced today (which
follows automatically from repr(dict) passing repr() down).
> except perhaps giving up on the fiction that __repr__() should
> return something expression-like for user-defined classes...
>
> Any suggestions?
Yes: stop being so radical <wink>. That eval(repr(obj)) == obj is a
*wonderful* property of the language design! I don't want to lose that
either. My claim is that repr() is rarely appropriate, but is essential
when it is. This is very often very apparent in user-defined classes (at
least mine <wink -- but I seem to hit this every time!>). It was hard to
see at first because the builtin types other than string treated repr() and
str() as synonyms, and nobody bitched enough about str(long)'s trailing "L",
and I didn't bitch enough about repr(float)'s == str(float)'s systematic
inaccuracy, and Python's European groupies didn't bitch enough about
repr(euro_string) being an unreadable backslashed mess.
IOW, I don't want to get rid of the str/repr design! I want to take it
seriously and do all the obvious <wink> things that follow from that.
telling-someone-what-they-really-think-takes-many-words-ly y'rs - tim
More information about the Python-list
mailing list