Text parsing via regex

Robocop bthayre at physics.ucsd.edu
Mon Dec 8 19:13:20 CET 2008


I'm having a little text parsing problem that i think would be really
quick to troubleshoot for someone more versed in python and Regexes.
I need to write a simple script that parses some arbitrarily long
string every 50 characters, and does not parse text in the middle of
words (but ultimately every parsed string should be 50 characters, so
adding in white spaces is necessary).  So i immediately came up with
something along the lines of:

string = "a bunch of nonsense that could be really long, or really
short depending on the situation"
r = re.compile(r".{50}")
m = r.match(string)

then i started to realize that i didn't know how to do exactly what i
wanted.  At this point i wanted to find a way to simply use something
like:

parsed_1, parsed_2,...parsed_n = m.groups()

However i'm having several problems.  I know that playskool regular
expression i wrote above will only parse every 50 characters, and will
blindly cut words in half if the parsed string doesn't end with a
whitespace.  I'm relatively new to regexes and i don't know how to
have it take that into account, or even what type of logic i would
need to fill in the extra whitespaces to make the string the proper
length when avoiding cutting words up.  So that's problem #1.  Problem
#2 is that because the string is of arbitrary length, i never know how
many parsed strings i'll have, and thus do not immediately know how
many variables need to be created to accompany them.  It's easy enough
with each pass of the function to find how many i will have by doing:
mag = len(string)
upper_lim = mag/50 + 1
But i'm not sure how to declare and set them to my parsed strings.
Now problem #1 isn't as pressing, i can technically get away with
cutting up the words, i'd just prefer not to.  The most pressing
problem right now is #2.  Any help, or suggestions would be great,
anything to get me thinking differently is helpful.



More information about the Python-list mailing list