[Tutor] Splitting by word boundaries
Michael Janssen
Janssen at rz.uni-frankfurt.de
Fri Aug 15 00:03:28 EDT 2003
On Thu, 14 Aug 2003, Jonathan Hayward http://JonathansCorner.com wrote:
> Neil Schemenauer wrote:
> >re.findall(r'\w+', ...) should do what is intended.
> I couldn't figure out how to get this to accept re.DOTALL or equivalent
> and wrote a little bit of HTML handling that the original regexp
> wouldn't have done:
[you needn't set re.DOTALL, when your pattern hasn't got a dot]
Running your code (after putting it into a state to run it ;-) doesn't
split input :-( For my eyes it is missing a while loop. In my approach I
relay on r'\b' to look ahead until next boundary and slice the string:
def boundary_split(s):
back = []
while 1:
try:
# r'.\b' and +1 prevents endless loop
pos = re.search(r'.\b', s, re.DOTALL).start()+1
except AttributeError:
if s: back.append(s)
break
back.append(s[:pos])
s = s[pos:]
return back
>>> boundary_split(" ad=\n2 ")
[' ', 'ad', '=\n', '2', ' ']
>>> boundary_split("<title>word boundary</title>")
['<', 'title', '>', 'word', ' ', 'boundary', '</', 'title', '>']
is it this what you want to get?
Michael
More information about the Tutor
mailing list