[Tutor] Splitting on punctuation

Albert-Jan Roskam fomcl at yahoo.com
Tue Jun 11 11:51:33 CEST 2013


----- Original Message -----
> From: eryksun <eryksun at gmail.com>
> To: Albert-Jan Roskam <fomcl at yahoo.com>
> Cc: Alan Gauld <alan.gauld at btinternet.com>; "tutor at python.org" <tutor at python.org>
> Sent: Monday, June 10, 2013 12:12 PM
> Subject: Re: [Tutor] Splitting on punctuation
> 
> On Mon, Jun 10, 2013 at 4:27 AM, Albert-Jan Roskam <fomcl at yahoo.com> 
> wrote:
>> 
>>>>> string.punctuation
>> '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>>>> re.split("[" + string.punctuation + "]+", 
> "yes, but no. But: yes, no")
>> ['yes', ' but no', ' But', ' yes', ' 
> no']
> 
> Even though you didn't use re.escape(), that almost works, except for
> backslash. Since the string doesn't start with ^ or end with ],
> neither is treated specially. Also, because string.punctuation is
> sorted, the range ,-. is valid, and even correct:

Thank you. I should have been aware of all those meta-characters. I did not know about re.escape but it's handy indeed.
 
 
>     >>> pat = re.compile('[,-.]', re.DEBUG)
>     in
>       range (44, 46)
> 
>     >>> map(ord, ',-.')
>     [44, 45, 46]

I tried using re.DEBUG but I can't really make sense of the output. If I am really desperate I sometimes use redemo.py in the scripts directory (or I'll email you guys --> I'll do that after this mail ;-)
 
 
 
> However, the otherwise harmless escape \] does consume the backslash.
> So remember to use re.escape.
> 
> Without re.escape:
> 
>     >>> pat1 = re.compile('[%s]+' % string.punctuation)
>     >>> pat1.split(r'yes, but no... But: yes\no')
>     ['yes', ' but no', ' But', ' 
> yes\\no']
> 
> With re.escape:
> 
>     >>> pat2 = re.compile('[%s]+' % 
> re.escape(string.punctuation))
>     >>> pat2.split(r'yes, but no... But: yes\no')
>     ['yes', ' but no', ' But', ' yes', 
> 'no']
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  


More information about the Tutor mailing list