[Tutor] Splitting on punctuation

eryksun eryksun at gmail.com
Mon Jun 10 12:12:18 CEST 2013


On Mon, Jun 10, 2013 at 4:27 AM, Albert-Jan Roskam <fomcl at yahoo.com> wrote:
>
>>>> string.punctuation
> '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>>> re.split("[" + string.punctuation + "]+", "yes, but no. But: yes, no")
> ['yes', ' but no', ' But', ' yes', ' no']

Even though you didn't use re.escape(), that almost works, except for
backslash. Since the string doesn't start with ^ or end with ],
neither is treated specially. Also, because string.punctuation is
sorted, the range ,-. is valid, and even correct:

    >>> pat = re.compile('[,-.]', re.DEBUG)
    in
      range (44, 46)

    >>> map(ord, ',-.')
    [44, 45, 46]

However, the otherwise harmless escape \] does consume the backslash.
So remember to use re.escape.

Without re.escape:

    >>> pat1 = re.compile('[%s]+' % string.punctuation)
    >>> pat1.split(r'yes, but no... But: yes\no')
    ['yes', ' but no', ' But', ' yes\\no']

With re.escape:

    >>> pat2 = re.compile('[%s]+' % re.escape(string.punctuation))
    >>> pat2.split(r'yes, but no... But: yes\no')
    ['yes', ' but no', ' But', ' yes', 'no']


More information about the Tutor mailing list