efficiently splitting up strings based on substrings

Sun Sep 6 03:48:19 EDT 2009

On Sep 6, 1:23 am, 7stud <bbxx789_0... at yahoo.com> wrote:
> On Sep 6, 1:14 am, 7stud <bbxx789_0... at yahoo.com> wrote:
>
>
>
> > On Sep 5, 5:29 pm, per <perfr... at gmail.com> wrote:
>
> > > On Sep 5, 7:07 pm, "Rhodri James" <rho... at wildebst.demon.co.uk> wrote:
>
> > > > On Sat, 05 Sep 2009 23:54:08 +0100, per <perfr... at gmail.com> wrote:
> > > > > On Sep 5, 6:42 pm, "Rhodri James" <rho... at wildebst.demon.co.uk> wrote:
> > > > >> On Sat, 05 Sep 2009 22:54:41 +0100, per <perfr... at gmail.com> wrote:
> > > > >> > I'm trying to efficiently "split" strings based on what substrings
> > > > >> > they are made up of.
> > > > >> > i have a set of strings that are comprised of known substrings.
> > > > >> > For example, a, b, and c are substrings that are not identical to each
> > > > >> > other, e.g.:
> > > > >> > a = "0" * 5
> > > > >> > b = "1" * 5
> > > > >> > c = "2" * 5
>
> > > > >> > Then my_string might be:
>
> > > > >> > my_string = a + b + c
>
> > > > >> > i am looking for an efficient way to solve the following problem.
> > > > >> > suppose i have a short
> > > > >> > string x that is a substring of my_string.  I want to "split" the
> > > > >> > string x into blocks based on
> > > > >> > what substrings (i.e. a, b, or c) chunks of s fall into.
>
> > > > >> > to illustrate this, suppose x = "00111". Then I can detect where x
> > > > >> > starts in my_string
> > > > >> > using my_string.find(x).  But I don't know how to partition x into
> > > > >> > blocks depending
> > > > >> > on the substrings.  What I want to get out in this case is: "00",
> > > > >> > "111".  If x were "001111122",
> > > > >> > I'd want to get out "00","11111", "22".
>
> > > > >> > is there an easy way to do this?  i can't simply split x on a, b, or c
> > > > >> > because these might
> > > > >> > not be contained in x.  I want to avoid doing something inefficient
> > > > >> > like looking at all substrings
> > > > >> > of my_string etc.
>
> > > > >> > i wouldn't mind using regular expressions for this but i cannot think
> > > > >> > of an easy regular
> > > > >> > expression for this problem.  I looked at the string module in the
> > > > >> > library but did not see
> > > > >> > anything that seemd related but i might have missed it.
>
> > > > >> I'm not sure I understand your question exactly.  You seem to imply
> > > > >> that the order of the substrings of x is consistent.  If that's the
> > > > >> case, this ought to help:
>
> > > > >> >>> import re
> > > > >> >>> x = "001111122"
> > > > >> >>> m = re.match(r"(0*)(1*)(2*)", x)
> > > > >> >>> m.groups()
>
> > > > >> ('00', '11111', '22')>>> y = "00111"
> > > > >> >>> m = re.match(r"(0*)(1*)(2*)", y)
> > > > >> >>> m.groups()
>
> > > > >> ('00', '111', '')
>
> > > > >> You'll have to filter out the empty groups for yourself, but that's
> > > > >> no great problem.
>
> > > > > The order of the substrings is consistent but what if it's not 0, 1, 2
> > > > > but a more complicated string? e.g.
>
> > > > > a = 1030405, b = 1babcf, c = fUUIUP
>
> > > > > then the substring x might be 4051ba, in which case using a regexp
> > > > > with (1*) will not work since both a and b substrings begin with the
> > > > > character 1.
>
> > > > Right.  This looks approximately nothing like what I thought your
> > > > problem was.  Would I be right in thinking that you want to match
> > > > substrings of your potential "substrings" against the string x?
>
> > > > I'm sufficiently confused that I think I'd like to see what your
> > > > use case actually is before I make more of a fool of myself.
>
> > > > --
> > > > Rhodri James *-* Wildebeest Herder to the Masses
>
> > > it's exactly the same problem, except there are no constraints on the
> > > strings.  so the problem is, like you say, matching the substrings
> > > against the string x. in other words, finding out where x "aligns" to
> > > the ordered substrings abc, and then determine what chunk of x belongs
> > > to a, what chunk belongs to b, and what chunk belongs to c.
>
> > > so in the example i gave above, the substrings are: a = 1030405, b =
> > > 1babcf, c = fUUIUP, so abc = 10304051babcffUUIUP
>
> > > given a substring like 4051ba, i'd want to split it into the chunks a,
> > > b, and c. in this case, i'd want the result to be: ["405", "1ba"] --
> > > i.e. "405" is the chunk of x that belongs to a, and "1ba" the chunk
> > > that belongs to be. in this case, there are no chunks of c.  if x
> > > instead were "4051babcffUU", the right output is: ["405", "1babcf",
> > > "fUU"], which are the corresponding chunks of a, b, and c that make up
> > > x respectively.
>
> > > i'm not sure how to approach this. any ideas/tips would be greatly
> > > appreciated. thanks again.
>
> > a = "1030405"
> > b = "1babcf"
> > c = "fUUIUP"
> > abc = "10304051babcffUUIUP"
> > data = "4051babcffU"
>
> > data_start = abc.find(data)
> > b_start = abc.find(b) - data_start
> > c_start = abc.find(c) - data_start
>
> > print data[:b_start]
> > print data[b_start:c_start]
> > print data[c_start:]
>
> > --output:--
> > 405
> > 1babcf
> > fU
>
> ...or maybe this is easier to follow:
>
> a = "1030405"
> b = "1babcf"
> c = "fUUIUP"
> abc = "10304051babcffUUIUP"
> data = "4051babcffU"
>
> data_start = abc.find(data)
> new_abc = abc[data_start:]
> print new_abc
> print data
> print "-" * 10
>
> --output:--
> 4051babcffUUIUP
> 4051babcffU
> ----------
>
> b_start = new_abc.find(b)
> c_start = new_abc.find(c)
>
> print data[:b_start]
> print data[b_start:c_start]
> print data[c_start:]
>
> --output:--
> 405
> 1babcf
> fU

Nope.  My solutions have problems with:

data = "cffU"

To handle that data, it gets messier:

a = "1030405"
b = "1babcf"
c = "fUUIUP"
abc = "10304051babcffUUIUP"
data = "cffU"

data_start = abc.find(data)
new_abc = abc[data_start:]
print new_abc
print data
print "-" * 10

b_start = new_abc.find(b)
if b_start == -1:
    b_start = 0

c_start = new_abc.find(c)
if c_start == -1:
    c_start = 0

print data[:b_start]
print data[b_start:c_start]
print data[c_start:]

If data is not a substring of abc, then that last line will select the
whole data string.