[Tutor] Splitting on punctuation
Albert-Jan Roskam
fomcl at yahoo.com
Tue Jun 11 11:51:33 CEST 2013
----- Original Message -----
> From: eryksun <eryksun at gmail.com>
> To: Albert-Jan Roskam <fomcl at yahoo.com>
> Cc: Alan Gauld <alan.gauld at btinternet.com>; "tutor at python.org" <tutor at python.org>
> Sent: Monday, June 10, 2013 12:12 PM
> Subject: Re: [Tutor] Splitting on punctuation
>
> On Mon, Jun 10, 2013 at 4:27 AM, Albert-Jan Roskam <fomcl at yahoo.com>
> wrote:
>>
>>>>> string.punctuation
>> '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>>>> re.split("[" + string.punctuation + "]+",
> "yes, but no. But: yes, no")
>> ['yes', ' but no', ' But', ' yes', '
> no']
>
> Even though you didn't use re.escape(), that almost works, except for
> backslash. Since the string doesn't start with ^ or end with ],
> neither is treated specially. Also, because string.punctuation is
> sorted, the range ,-. is valid, and even correct:
Thank you. I should have been aware of all those meta-characters. I did not know about re.escape but it's handy indeed.
> >>> pat = re.compile('[,-.]', re.DEBUG)
> in
> range (44, 46)
>
> >>> map(ord, ',-.')
> [44, 45, 46]
I tried using re.DEBUG but I can't really make sense of the output. If I am really desperate I sometimes use redemo.py in the scripts directory (or I'll email you guys --> I'll do that after this mail ;-)
> However, the otherwise harmless escape \] does consume the backslash.
> So remember to use re.escape.
>
> Without re.escape:
>
> >>> pat1 = re.compile('[%s]+' % string.punctuation)
> >>> pat1.split(r'yes, but no... But: yes\no')
> ['yes', ' but no', ' But', '
> yes\\no']
>
> With re.escape:
>
> >>> pat2 = re.compile('[%s]+' %
> re.escape(string.punctuation))
> >>> pat2.split(r'yes, but no... But: yes\no')
> ['yes', ' but no', ' But', ' yes',
> 'no']
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
More information about the Tutor
mailing list