efficiently splitting up strings based on substrings

Sat Sep 5 18:54:08 EDT 2009

On Sep 5, 6:42 pm, "Rhodri James" <rho... at wildebst.demon.co.uk> wrote:
> On Sat, 05 Sep 2009 22:54:41 +0100, per <perfr... at gmail.com> wrote:
> > I'm trying to efficiently "split" strings based on what substrings
> > they are made up of.
> > i have a set of strings that are comprised of known substrings.
> > For example, a, b, and c are substrings that are not identical to each
> > other, e.g.:
> > a = "0" * 5
> > b = "1" * 5
> > c = "2" * 5
>
> > Then my_string might be:
>
> > my_string = a + b + c
>
> > i am looking for an efficient way to solve the following problem.
> > suppose i have a short
> > string x that is a substring of my_string.  I want to "split" the
> > string x into blocks based on
> > what substrings (i.e. a, b, or c) chunks of s fall into.
>
> > to illustrate this, suppose x = "00111". Then I can detect where x
> > starts in my_string
> > using my_string.find(x).  But I don't know how to partition x into
> > blocks depending
> > on the substrings.  What I want to get out in this case is: "00",
> > "111".  If x were "001111122",
> > I'd want to get out "00","11111", "22".
>
> > is there an easy way to do this?  i can't simply split x on a, b, or c
> > because these might
> > not be contained in x.  I want to avoid doing something inefficient
> > like looking at all substrings
> > of my_string etc.
>
> > i wouldn't mind using regular expressions for this but i cannot think
> > of an easy regular
> > expression for this problem.  I looked at the string module in the
> > library but did not see
> > anything that seemd related but i might have missed it.
>
> I'm not sure I understand your question exactly.  You seem to imply
> that the order of the substrings of x is consistent.  If that's the
> case, this ought to help:
>
> >>> import re
> >>> x = "001111122"
> >>> m = re.match(r"(0*)(1*)(2*)", x)
> >>> m.groups()
>
> ('00', '11111', '22')>>> y = "00111"
> >>> m = re.match(r"(0*)(1*)(2*)", y)
> >>> m.groups()
>
> ('00', '111', '')
>
> You'll have to filter out the empty groups for yourself, but that's
> no great problem.
>
> --
> Rhodri James *-* Wildebeest Herder to the Masses

The order of the substrings is consistent but what if it's not 0, 1, 2
but a more complicated string? e.g.

a = 1030405, b = 1babcf, c = fUUIUP

then the substring x might be 4051ba, in which case using a regexp
with (1*) will not work since both a and b substrings begin with the
character 1.

your solution works if that weren't a possibility, so what you wrote
is definitely the kind of solution i am looking for. i am just not
sure how to solve it in the general case where the substrings might be
similar to each other (but not similar enough that you can't tell
where the substring came from).