str vs. repr

Fri Nov 5 02:04:30 EST 1999

[amazingly enough, still sticking to what the subject line sez <ahem>!]

Let me back off to what repr and str "should do":

repr(obj) should return a string such that
    eval(repr(obj)) == obj
I'd go further and formalize a property that's currently honored but
implicitly:  the string returned by repr should contain only characters c
such that
    c in "\t\n" or 32 <= ord(c) < 127
i.e. it should be restricted to the set of 7-bit ASCII characters that the C
std guarantees can be read back in faithfully in text mode.  Together, those
properties spell out what repr() is "good for":  a highly portable and
eval'able text representation that captures *everything* important about an
object's value.

In contrast, str() should return a string that's pleasant for people to
read.  This may involve suppressing obscure or minor details, displaying it
in a form that's inefficient-- or impossible --to process, or even plain
lying.  Examples include the Rat class I used before (the ratio of two huge
integers is exact but unhelpful), a DateTime class that stores times as
seconds from the epoch (ditto), or a Matrix class whose instances may
contain megabytes of information (and so a faithful repr() string may be
intolerable almost all the time).

Viewed that way, I'd collapse all of your (Guido's) characterizations of
what I'm after into one:  repr should never be invoked implicitly.  If you
want a faithful (but often unhelpful) string, use repr(obj) or `obj`
explicitly.  Part of not invoking repr implicitly includes containers not
inventing repr() calls out of thin air <wink>.

David Ascher suggested that str(long) drop the trailing "L", and that's a
good example.  The "L" is appropriate for repr(), for so long as Python
maintains such a sharp distinction between ints and longs, but is really of
no help to most users most of the time.

repr() currently cheats in another way:  for purely aesthetic reasons,
repr(float) and repr(complex) don't generate enough digits to allow exact
reconstruction of their arguments (assuming high-quality float<->string in
the platform libc).  This is a case where an argument appropriate to str()
was applied to repr(), not because it makes *sense* for repr, but because so
much output goes thru an *implicit* repr() now and you didn't want to see
all those "ugly" long float strings <0.50000000000000079 wink>.

repr(float) should do its job correctly here; str(float) should continue to
produce the "nice" (but numerically inadequate) strings.  (BTW, on an
IEEE-754 platform repr(float) should generate up to 17 significant digits
(suppression of trailing zeroes is fine) -- the 754 std guarantees that if
conforming output generates that many, a conforming input routine will
exactly reconstruct the original value -- note that this does *not* require
best-possible conversion in either direction -- it turns out that 17 is
enough for somewhat sloppy conversions to succeed, and many platforms are up
to this easier task)

Onward.  Or backward:

[Tim]
>> In a world where containers passed str() down, a container's
>> str() would presumably be responsible for adding disambiguating
>> delimeters to element str() results when needed (the container
>> knows its own output syntax, and can examine the strings produced
>> by its elements ...

[Guido]
> Hm...  What kind of things would you expect e.g. the list str() to do
> to its item str()s?  Put backslashes before commas?

This depends on how seriously you take all this <wink>.

If you take it very seriously, __str__ should take a second flag argument,
saying whether the consumer of the string will be embedding the string in a
larger context or displaying it directly.  Then it's up to __str__ to put
its own brand of delimiters around its output when appropriate.

I suspect that's overkill, though -- that the problem here is very specific
to a string object's __str__ (containers will put matching brackets at each
end of their str() or repr() output regardless, numbers of various sort
don't need delimiters, and a user class __str__ generally produces a highly
stylized string -- seems that it's *only* string objects that have wildly
unpredictable display forms).

If true, it may work fine just to be simple and pragmatic:  let containers
special-case the snot out of elements of string type, sticking a pair of
quotes around what string's __str__ returns and backslash-escaping any
embedded quotes of the same type.

Then

>>> names = ["François", "Tim", "Gu'ido"]
>>> names   # same as str(names) in this make-believe world
['François', 'Tim', 'Gu\'ido']
    or
['François', 'Tim', "Gu'ido"]

is what I'd expect.

I certainly don't want to get rid of the commas, colons, parens, braces and
brackets for list, tuple and dict str() output:  they're very helpful!  I
just want to be able to read what's *between* all that stuff.

> These are all good points.

Ya, well ... everything sounds kinda good before it's implemented <0.6
wink>.

> In a typical scenario which I wanted to avoid, a user has a variable
> containing the string '1' but mistakenly believes that it contains the
> integer 1.  (This happens a lot, e.g. it could be read from a file
> containing numbers.)  The user tries various numeric operations on the
> variable and they all raise exceptions.  The user is inexperienced and
> doesn't understand what the exceptions are, but gets the idea to
> display its value to see if something's wrong with it.  One of the
> first things users learn is to use interactive Python as a power
> calculator, so my hypothetical user just types the name of the
> variable.  If this would use str() to format the value, the user is no
> wiser, and perhaps more confused, since str('1') is the same as
> str(1).

Understood and appreciated.  Using the pragmatic approach above, I would
have PRINT_EXPR call the same routine that container str() implementations
call, i.e. one that special-cases the snot out of strings.  Then

>>> x, y = 1, '1'
>>> x
1
>>> y
'1'
>>> print x
1
>>> print y
1
>>>

is what I'd expect.

> ...
> We can then argue over what str() of a list L should return; one
> extreme possibility would be to return string.join(map(str, L));

I definitely don't want to lose the brackets or the commas.

> a slightly less radical solution would be
>     '[' + string.join(map(str, L), ', ') + ']'

Yes, but replacing str with str_special_casing_the_snot_out_of_strings.

> In the first case, your last example would go like this:
>
> >>> names
> François Tim
> >>>

Yuck!

> while the [second] choice would give
>
> >>> names
> [François, Tim]
> >>>

As above, ['François', 'Tim'] is my ideal.

> There may be other solutions -- e.g. in Tcl, a list is displayed as
> the items separated by spaces, with the proviso that items containing
> spaces are displayed inside (matching) curly braces; unmatched braces
> are displayed using backslashes, guaranteeing that the output can be
> parsed back into a list with the same value as the original.  (Hey!
> That's the same as Python's rule!  So why does it work in Tcl?
> Because variables only contain strings, and the equivalent of the Rat
> class used above can't be coded in Tcl.)

I'm explicitly giving up the property that str(obj) always be eval'able in a
sensible way.  That's repr()'s job.  For example, I expect

>>> x = "\\t"
>>> x
'\t'
>>>

and eval('\t') certainly isn't x.  This is a tradeoff:  we stop pretending
that str() is adequate for equality under evalability <wink>, but take EUE
seriously for repr().  In return, str() gets to do what it was meant to do:
produce "nice" strings, without compromise for the sake of faux
(inconsistent, unpredictable) EUE.

> The problem with displaying 'François' has been mentioned to me
> before.  (By the way, I have no idea how to *type* that.  I'm just
> cutting and pasting it from Tim's message.)

European keyboard, programmable keyboard, C-q something-or-other in Emacs,
Alt+0231 in Windows (keyboard generation of any 8-bit code is built in to
Windows keyboard drivers), or even copy+paste from Accessories->Character
Map under Windows.  My employer spends a lot of time worrying about other
countries' foolish character sets <wink>.

> There's another scenario I was trying to avoid.  This is probably
> something that happened once too many times when I was young and
> innocent, so I may be overracting.  Consider the following:
>
> >>> f = open("somefile")
> >>> a = f.readline()
> >>> print a
> %âãÏÓ1.3
>
> >>>
>
> Now this example is relatively harmless.  Using repr(), I see that the
> string contains a \r character that caused the cursor to back up to
> the start of the line, overwriting what was already written:
>
> >>> a
> '%PDF-1.3\015%\342\343\317\323\015\012'
> >>>
>
> But the thing that some big bad program did to me long ago was more
> like spit out several thousand garbage bytes which contained enough
> escape sequences to lock up my terminal requiring me to powercycle and
> log in again.  (The fact that the story refers to a terminal indicates
> how long ago this was. :-)
>
> So I vowed that *my* language would not (easily) let this happen by
> accident, and the way I enforced that was by making sure that all
> non-ASCII characters would be printed as octal escapes, unless you use
> the print statement.

An irony is that I've locked up a DOS box doing this on many occasions -- in
a script, "print a" is not exactly a *hard* mistake to fall into <wink>.

I'm not unsympathetic, but this is something you really can't stop without
rendering Python useless to grownups.  How does Python's Unicode story
relate to this?  Displaying a Unicode string as a pile of \uxxxx (whatever)
escapes defeats the whole purpose of Unicode.  Ditto displaying it in UTF-7.
OTOH, for some number of years to come it will be a rare display device that
*won't* treat Unicode 16-bit values (whether or not UTF-8 encoded) as a pile
of 8-bit control codes.

> It's a separate dilemma from the other examples.  My problem here is
> that I hate to make assumptions about the character set in use.  How
> do I know that '\237' is unprintable but '\241' is a printable
> character?

We have no idea.

> How do I know that the latter is an upside-down exclamation point?

Ditto.

> Should I really assume stdout is capable of displaying Latin-1?

No.  But I don't grasp why you think you need to know *anything* about this.
Until Unicode takes over the world, there's nothing you can do other than
tell users the truth:  "most of" printable 7-bit ASCII displays the same way
across the world, but outside of that *all* bets are off.  It will vary by
all of OS, display device, displaying program, and user configuration
choices.

> Strictly, the str() function doesn't even know that it's output is
> going to stdout.

That's right, and C doesn't even guarantee that I can read François back in
from a text file!  It's that bad.  *So* bad that everyone lives with it and
never notices it <0.9 wink>.

> I suppose I could use isprint(), and then François could use the locale
> module in his $PYTHONSTARTUP file to make it do the right thing.  Is that
> good enough?

I don't know what you're trying to *accomplish* here.  isprint() is a piece
of crap, not least because what can or can't be displayed has much more to
do with the font you're currently using than with anything C knows about.
That is, it's outside C's areas of competence.

For example, here's the relatively new euro symbol:  €.  That's not in
Latin-1.  Windows added it as an extension to Latin-1, at hex code 0x80 (one
of the 32 codes Latin-1 didn't assign glyphs to).  In my mailer at the
moment, even on Windows, it shows up as a square box.  That's because my
font at the moment happens to be Courier.  If I switch my font to Courier
New, it shows up as intended -- but even then only because I happened to
install the service pack that added the euro symbol, and Courier New (unlike
Courier) is a "code page 1252" font.  On any non-Windows system, God only
knows what you'll see.  The point is that C's isprint(0x80) can't give a
reasonable answer even sticking solely to my machine!  What I can or can't
display has, alas, nothing to do with C.

> (I just tried this briefly.  It seems that somehow the locale module
> doesn't affect this?!?!

Locale is supposed to affect all of C's isxxx functions, but I wouldn't bet
that most implementations do it correctly.  As above, even if they did,
what's actually displayable is a different issue.

> I still think that ['a', 'b'] should be displayed like that, and not
> like [a, b].

Me too!  I don't want to make the currently-readable unreadable -- I want to
keep that and get the converse too.

> I'm not sure what that to do about a dict of Rat() items,

str(recip) (my example) produces {0.100: 1.000, 1.000: 0.100} (which follows
automatically from str(dict) "passing str() down", and that neither the keys
nor the values are themselves of string type); repr(recip) produces the
unreadable but precise and eval'able string that's produced today (which
follows automatically from repr(dict) passing repr() down).

> except perhaps giving up on the fiction that __repr__() should
> return something expression-like for user-defined classes...
>
> Any suggestions?

Yes:  stop being so radical <wink>.  That eval(repr(obj)) == obj is a
*wonderful* property of the language design!  I don't want to lose that
either.  My claim is that repr() is rarely appropriate, but is essential
when it is.  This is very often very apparent in user-defined classes (at
least mine <wink -- but I seem to hit this every time!>).  It was hard to
see at first because the builtin types other than string treated repr() and
str() as synonyms, and nobody bitched enough about str(long)'s trailing "L",
and I didn't bitch enough about repr(float)'s == str(float)'s systematic
inaccuracy, and Python's European groupies didn't bitch enough about
repr(euro_string) being an unreadable backslashed mess.

IOW, I don't want to get rid of the str/repr design!  I want to take it
seriously and do all the obvious <wink> things that follow from that.

telling-someone-what-they-really-think-takes-many-words-ly y'rs  - tim

__str__ vs. __repr__

str vs. repr