Text parsing via regex
ivlenin at gmail.com
Mon Dec 8 20:22:14 CET 2008
On Mon, 08 Dec 2008 13:42:00 -0500, r0g wrote:
> Robocop wrote:
>> However i'm having several problems. I know that playskool regular
>> expression i wrote above will only parse every 50 characters, and will
>> blindly cut words in half if the parsed string doesn't end with a
>> whitespace. I'm relatively new to regexes and i don't know how to have
>> it take that into account, or even what type of logic i would need to
>> fill in the extra whitespaces to make the string the proper length when
>> avoiding cutting words up. So that's problem #1.
Regexps may not be the solution here. You could consider the textwrap
module ( http://docs.python.org/library/textwrap.html ), although that
will only split your text into strings up to 50 characters long, rather
than padding with whitespace to exactly 50 characters.
If you really need the strings to be exactly 50 characters long (and, are
you sure you do?), try:
# Split the input up into separate words
words = input_string.split()
groups = 
current_string = ''
current_length = 0
for word in words:
if current_length + len(word) +1 <= 50:
# If adding a space followed by the current word
# wouldn't take us over 50 chars, add the word.
current_string += ' ' + word
current_length += len(word)+1
# Pad the string with spaces, and add it to our
# list of string
current_string += ' ' * (50 - current_length)
current_string = word
current_length = len(word)
>> Problem #2 is that
>> because the string is of arbitrary length, i never know how many parsed
>> strings i'll have, and thus do not immediately know how many variables
>> need to be created to accompany them. It's easy enough with each pass
>> of the function to find how many i will have by doing: mag =
>> upper_lim = mag/50 + 1
>> But i'm not sure how to declare and set them to my parsed strings.
Whenever you find yourself thinking "I don't know how many variables I
need," the answer is almost always that you need one variable, which is a
list. In the code above, the 50-char-long strings will all get put in the
list called "groups".
More information about the Python-list