[Python-3000] updated PEP3126: Remove Implicit String Concatenation

Mon May 7 17:47:00 CEST 2007

Rewritten -- please tell me if there are any concerns I have missed.

And of course, please tell me if you have a suggestion for the open
issue -- how to better support external internationalization tools, or
at least xgettext in particular.

-jJ

-----------------------------------

PEP: 3126
Title: Remove Implicit String Concatenation
Version: $Revision$
Last-Modified: $Date$
Author: Jim J. Jewett <JimJJewett at gmail.com>,
        Raymond D. Hettinger <python at rcn.com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 29-Apr-2007
Post-History: 29-Apr-2007, 30-Apr-2007, 07-May-2007

Abstract
========

Python inherited many of its parsing rules from C.  While this has
been generally useful, there are some individual rules which are less
useful for python, and should be eliminated.

This PEP proposes to eliminate implicit string concatenation based
only on the adjacency of literals.

Instead of::

    "abc" "def" == "abcdef"

authors will need to be explicit, and either add the strings::

    "abc" + "def" == "abcdef"

or join them::

    "".join(["abc", "def"]) == "abcdef"

Motivation
==========

One goal for Python 3000 should be to simplify the language by
removing unnecessary features.  Implicit string concatenation should
be dropped in favor of existing techniques. This will simplify the
grammar and simplify a user's mental picture of Python.  The latter is
important for letting the language "fit in your head".  A large group
of current users do not even know about implicit concatenation.  Of
those who do know about it, a large portion never use it or habitually
avoid it. Of those who both know about it and use it, very few could
state with confidence the implicit operator precedence and under what
circumstances it is computed when the definition is compiled versus
when it is run.

History or Future
-----------------

Many Python parsing rules are intentionally compatible with C.  This
is a useful default, but Special Cases need to be justified based on
their utility in Python.  We should no longer assume that python
programmers will also be familiar with C, so compatibility between
languages should be treated as a tie-breaker, rather than a
justification.

In C, implicit concatenation is the only way to join strings without
using a (run-time) function call to store into a variable.  In Python,
the strings can be joined (and still recognized as immutable) using
more standard Python idioms, such ``+`` or ``"".join``.

Problem
-------

Implicit String concatentation leads to tuples and lists which are
shorter than they appear; this is turn can lead to confusing, or even
silent, errors.  For example, given a function which accepts several
parameters, but offers a default value for some of them::

    def f(fmt, *args):
        print fmt % args

This looks like a valid call, but isn't::

    >>> f("User %s got a message %s",
          "Bob"
          "Time for dinner")

    Traceback (most recent call last):
      File "<pyshell#8>", line 2, in <module>
        "Bob"
      File "<pyshell#3>", line 2, in f
        print fmt % args
    TypeError: not enough arguments for format string

Calls to this function can silently do the wrong thing::

    def g(arg1, arg2=None):
        ...

    # silently transformed into the possibly very different
    # g("arg1 on this linearg2 on this line", None)
    g("arg1 on this line"
      "arg2 on this line")

To quote Jason Orendorff [#Orendorff]

    Oh.  I just realized this happens a lot out here.  Where I work,
    we use scons, and each SConscript has a long list of filenames::

        sourceFiles = [
            'foo.c'
            'bar.c',
            #...many lines omitted...
            'q1000x.c']

    It's a common mistake to leave off a comma, and then scons
    complains that it can't find 'foo.cbar.c'.  This is pretty
    bewildering behavior even if you *are* a Python programmer,
    and not everyone here is.

Solution
========

In Python, strings are objects and they support the __add__ operator,
so it is possible to write::

    "abc" + "def"

Because these are literals, this addition can still be optimized away
by the compiler; the CPython compiler already does so.
[#rcn-constantfold]_

Other existing alternatives include multiline (triple-quoted) strings,
and the join method::

    """This string
       extends across
       multiple lines, but you may want to use something like
       Textwrap.dedent
       to clear out the leading spaces
       and/or reformat.
    """

    >>> "".join(["empty", "string", "joiner"]) == "emptystringjoiner"
    True

    >>> " ".join(["space", "string", "joiner"]) == "space string joiner"

    >>> "\n".join(["multiple", "lines"]) == "multiple\nlines" == (
    """multiple
    lines""")
    True

Concerns
========

Operator Precedence
-------------------

Guido indicated [#rcn-constantfold]_ that this change should be
handled by PEP, because there were a few edge cases with other string
operators, such as the %.  (Assuming that str % stays -- it may be
eliminated in favor of PEP 3101 -- Advanced String Formatting.
[#PEP3101]_ [#elimpercent]_)

The resolution is to use parentheses to enforce precedence -- the same
solution that can be used today::

    # Clearest, works today, continues to work, optimization is
    # already possible.
    ("abc %s def" + "ghi") % var

    # Already works today; precedence makes the optimization more
    # difficult to recognize, but does not change the semantics.
    "abc" + "def %s ghi" % var

as opposed to::

    # Already fails because modulus (%) is higher precedence than
    # addition (+)
    ("abc %s def" + "ghi" % var)

    # Works today only because adjacency is higher precedence than
    # modulus.  This will no longer be available.
    "abc %s" "def" % var

    # So the 2-to-3 translator can automically replace it with the
    # (already valid):
    ("abc %s" + "def") % var

Long Commands
-------------

    ... build up (what I consider to be) readable SQL queries [#skipSQL]_::

        rows = self.executesql("select cities.city, state, country"
                               "    from cities, venues, events, addresses"
                               "    where cities.city like %s"
                               "      and events.active = 1"
                               "      and venues.address = addresses.id"
                               "      and addresses.city = cities.id"
                               "      and events.venue = venues.id",
                               (city,))

Alternatives again include triple-quoted strings, ``+``, and ``.join``::

    query="""select cities.city, state, country
                 from cities, venues, events, addresses
                 where cities.city like %s
                   and events.active = 1"
                   and venues.address = addresses.id
                   and addresses.city = cities.id
                   and events.venue = venues.id"""

    query=( "select cities.city, state, country"
          + "    from cities, venues, events, addresses"
          + "    where cities.city like %s"
          + "      and events.active = 1"
          + "      and venues.address = addresses.id"
          + "      and addresses.city = cities.id"
          + "      and events.venue = venues.id"
          )

    query="\n".join(["select cities.city, state, country",
                     "    from cities, venues, events, addresses",
                     "    where cities.city like %s",
                     "      and events.active = 1",
                     "      and venues.address = addresses.id",
                     "      and addresses.city = cities.id",
                     "      and events.venue = venues.id"])

    # And yes, you *could* inline any of the above querystrings
    # the same way the original was inlined.
    rows = self.executesql(query, (city,))

Regular Expressions
-------------------

Complex regular expressions are sometimes stated in terms of several
implicitly concatenated strings with each regex component on a
different line and followed by a comment.  The plus operator can be
inserted here but it does make the regex harder to read.  One
alternative is to use the re.VERBOSE option.  Another alternative is
to build-up the regex with a series of += lines::

    # Existing idiom which relies on implicit concatenation
    r = ('a{20}'  # Twenty A's
         'b{5}'   # Followed by Five B's
         )

    # Mechanical replacement
    r = ('a{20}'  +# Twenty A's
         'b{5}'   # Followed by Five B's
         )

    # already works today
    r = '''a{20}  # Twenty A's
           b{5}   # Followed by Five B's
        '''                 # Compiled with the re.VERBOSE flag

    # already works today
    r = 'a{20}'   # Twenty A's
    r += 'b{5}'   # Followed by Five B's

Internationalization
--------------------

Some internationalization tools -- notably xgettext -- have already
been special-cased for implicit concatenation, but not for Python's
explicit concatenation. [#barryi8]_

These tools will fail to extract the (already legal)::

    _("some string" +
      " and more of it")

but often have a special case for::

    _("some string"
      " and more of it")

It should also be possible to just use an overly long line (xgettext
limits messages to 2048 characters [#xgettext2048]_, which is less
than Python's enforced limit) or triple-quoted strings, but these
solutions sacrifice some readability in the code::

    # Lines over a certain length are unpleasant.
    _("some string and more of it")

    # Changing whitespace is not ideal.
    _("""Some string
         and more of it""")
    _("""Some string
    and more of it""")
    _("Some string \
    and more of it")

I do not see a good short-term resolution for this.

Transition
==========

The proposed new constructs are already legal in current Python, and
can be used immediately.

The 2 to 3 translator can be made to mechanically change::

    "str1" "str2"
    ("line1"  #comment
     "line2")

into::

    ("str1" + "str2")
    ("line1"   +#comments
     "line2")

If users want to use one of the other idioms, they can; as these
idioms are all already legal in python 2, the edits can be made
to the original source, rather than patching up the translator.

Open Issues
===========

Is there a better way to support external text extraction tools, or at
least ``xgettext`` [#gettext]_ in particular?

References
==========

..  [#Orendorff] Implicit String Concatenation, Orendorff
    http://mail.python.org/pipermail/python-ideas/2007-April/000397.html

..  [#rcn-constantfold] Reminder: Py3k PEPs due by April, Hettinger,
    van Rossum
    http://mail.python.org/pipermail/python-3000/2007-April/006563.html

..  [#PEP3101] PEP 3101, Advanced String Formatting, Talin
    http://www.python.org/peps/pep-3101.html

..  [#elimpercent] ps to question Re: Need help completing ABC pep,
    van Rossum
    http://mail.python.org/pipermail/python-3000/2007-April/006737.html

..  [#skipSQL] (email Subject) PEP 30XZ: Simplified Parsing, Skip,
    http://mail.python.org/pipermail/python-3000/2007-May/007261.html

..  [#barryi8] (email Subject) PEP 30XZ: Simplified Parsing
    http://mail.python.org/pipermail/python-3000/2007-May/007305.html

..  [#gettext] GNU gettext manual
    http://www.gnu.org/software/gettext/

..  [#xgettext2048] Unix man page for xgettext -- Notes section
    http://www.scit.wlv.ac.uk/cgi-bin/mansec?1+xgettext

Copyright
=========

    This document has been placed in the public domain.

..
    Local Variables:
    mode: indented-text
    indent-tabs-mode: nil
    sentence-end-double-space: t
    fill-column: 70
    coding: utf-8
    End: