[Tutor] How to substitute an element of a list as a pattern for re.compile()

Thu Dec 30 07:06:10 CET 2004

kumar s wrote:
> I have Question: 
> How can I substitute an object as a pattern in making
> a pattern. 
> 
>>>> x = 30
>>>> pattern = re.compile(x)
> 

Kumar,

You can use string interpolation to insert x into a string, which can 
then be compiled into a pattern:

x = 30
pat = re.compile('%s'%x)

I really doubt regular expressions will speed up your current searching 
algorithm. You probably need to reconsider the data structures you are 
using to represent your data.

> I have a list of numbers that I have to match in
> another list and write them to a new file:
> 
> List 1: range_cors 
>>>> range_cors[1:5]
> ['161:378', '334:3', '334:4', '65:436']
> 
> List 2: seq
>>>> seq[0:2]
> ['>probe:HG-U133A_2:1007_s_at:416:177;
> Interrogation_Position=3330; Antisense;',
> 'CACCCAGCTGGTCCTGTGGATGGGA']
> 
> 

Can you re-process your second list? One option might be to store that 
list instead as a dict, where the keys are what you want to search by 
(maybe a string like '12:34' or a tuple like (12,34)).

Maybe something like the following:

 >>> range_cors = ['12:34','34:56']
 >>> seq = {'12:34': ['some 12:34 data'],
...        '34:56': ['some 34:56'data','more 34:56 data']}
 >>> for item in range_cors:
... 	print seq[item]
... 	
['some 12:34 data']
['some 34:56 data','more 34:56 data']

Why is this better?

If you have m lines of data and n patterns to search for, then using 
either of your methods you perform n searches per line,  totalling 
approx. m*n operations. You have to complete approx. m*n operations 
whether you use the string searching version, or re searching version.

If you pre-process the data so that it can be stored in and retrieved 
from a dict, pre-processing to get your data into that dict costs you 
roughly m operations, but your n pattern lookups into that dict cost you 
only n operations, so you only have to complete approx. m+n operations.

> A slow method:
>>>> sequences = []
>>>> for elem1 in range_cors:
> 	for index,elem2 in enumerate(seq):
> 		if elem1 in elem2:
> 			sequences.append(elem2)
> 			sequences.append(seq[index+1])
> 
> A faster method (probably):
> 
>>>> for i in range(len(range_cors)):
> 	for index,m in enumerate(seq):
> 		pat = re.compile(i)
> 		if re.search(pat,seq[m]):
> 			p.append(seq[m])
> 			p.append(seq[index+1])
> 

> I am getting errors, because I am trying to create an
> element as a pattern in re.compile(). 
> 

pat = re.compile('%s'%i) would probably get rid of the error message, 
but that's probably still not what you want.

> 
> Questions:
> 
> 1. Is it possible to do this. If so, how can I do this.  

You can try, but I doubt regular expressions will help; that approach 
will probably be even slower.

> Can any one help correcting my piece of code and
> suggesting where I went wrong. 

I would scrap what you have and try using a better data structure. I 
don't know enough about your data to make more specific processing 
recommendations; but you can probably avoid those nested loops with some 
careful data pre-processing.

You'll likely get better suggestions if you post a more representative 
sample of your data, and explain exactly what you want as output.

Good luck.

Rich