[Tutor] how to split/partition a string on keywords?

Fri Aug 24 21:33:00 CEST 2012

On Fri, Aug 24, 2012 at 1:30 PM, Jared Nielsen <nielsen.jared at gmail.com> wrote:
>
> But if I run the following:
>
>
> #!/usr/bin/python
>
> text = raw_input("Enter text: ")
>
> text.replace("and", "\nand")
> text.replace("or", "\nor")
>
> print text
>
> I get the text as it was entered.
> Is there a way to replace text in a string without splitting or
> partitioning?

A Python string is immutable. So replace() just gives you a new
string. You can assign it to a different variable.

> The bigger picture for this little project is a "poetry machine", in which a
> user enters some prose and the program chops it up into modern poetry.

Based on your original problem description, I think you'll get the
best result using regular expression substitution:

    import re

    words = "|".join(["and", "or"])

    pattern = r"(^|\W)({kwds})(?=\W|$)".format(kwds=words)

    prose = []
    line = raw_input("Enter text: ")
    while line:
        prose.append(line)
        line = raw_input()

    poetry = " ".join(re.sub(pattern, r"\1\n\2", line) for line in prose)
    print poetry

    # try
    # or... ham and cheese or chicken and waffles...and...
    # whatever she wants--and he'll have the special.

Output:

    or... ham
    and cheese
    or chicken
    and waffles...
    and... whatever she wants--
    and he'll have the special.

The first thing to note with regular expressions is always use (r)aw
string literals, such as r"text". The expressions make heavy use of
backslash escapes, so you don't want Python's compiler to process
regular string escapes. It's much simpler to use raw mode than to
quote all the backslashes as '\\'.

In "pattern" you see three sub-patterns (groups) in parentheses. In
group one, the ^ matches at the beginning of the line, the \W matches
any non-alphanumeric character, and the | means the group will match
on either ^ or \W. In group two, I'm using Python's string formatting
to map the string "and|or" into {kwds}. As before the | means this
group matches on either the literal "and" or the literal "or".

Group 3 is a bit more complicated. It starts with the ?= operator.
This is a lookahead operation. When this groups matches it won't
consume any of the string. This allows overlapping matches on the
non-alphanumeric character \W. In other words, if you have "some text
and and repeated", the whitespace joining the first " and " and the
second " and " can count toward both matches. The $ means this group
also matches at the end of the line.

Finally all of the prose lines are processed with re.sub. This looks
for "pattern" in "line" and replaces it with r"\1\n\2". In the
replacement string \1 is group 1, \2 is group 2, and \n is a new line
character.

Please see the re docs for further explanation:

http://docs.python.org/library/re

Here's a pretty good tutorial in general:

http://www.regular-expressions.info

> So, this is a long shot naive noob question, but is there any way to count
> syllables in words in a string? Or at least approximate this procedure?

At this point you're getting into natural language processing. You can
try the Natural Language Processing Toolkit (NLTK), but I don't know
if it breaks words into syllables:

http://nltk.org

Good luck.