[Tutor] Testing a string to see if it contains a substring

Steven D'Aprano steve at pearwood.info
Thu Jan 22 01:15:28 CET 2015


On Wed, Jan 21, 2015 at 10:14:42AM -0800, dw wrote:
> Hello Python Friends.
> I have a string array, called "line_array".

Do you mean a list of strings? "String array" is not a standard Python 
term, it could mean something from the array module, from numpy, or 
something completely different.

It's often good to give simplified example code, rather than try to 
describe it in words, e.g.:

line_array = ["line 1", "line 2"]

> There may be up to 50 or more elements in the array.
> So:
> - line_array[1] may contain "01/04/2013  10:43 AM        17,410,217
> DEV-ALL-01-04-13.rlc\n"
> - line_array[2] may contain "01/25/2013  03:21 PM        17,431,230
> DEV-ALL-01-25-2013.rlc\n"
> - line_array[3] may contain "\n"

What happened to line_array[0] ?


> I want to retain all elements which are valid (i.e. contains a date
> value xx/xx/xxxx)
> So I'm using a regex search for the date value located at the start of
> each element...this way

Based on your description, I think the best way to do this is:

# remove blank lines
line_array = [line for line in line_array if line != '\n']


Possibly this is even nicer:

# get rid of unnecessary leading and trailing whitespace on each line
# and then remove blanks
line_array = [line.strip() for line in line_array]
line_array = [line for line in line_array if line]


This is an alternative, but perhaps a little cryptic for those not 
familiar with functional programming styles:

line_array = filter(None, map(str.strip, line_array))

No regexes required!

However, it isn't clear from your example whether non-blank lines 
*always* include a date. Suppose you have to filter date lines from 
non-date lines?

Start with a regex and a tiny helper function, which we can use lambda 
to embed directly in the call to filter:

DATE = r'\d{2}/\d{2}/\d{4}'
line_array = filter(lambda line: re.search(DATE, line), line_array)

In Python version 3, you may need to wrap that in a call to list:

line_array = list(filter(lambda line: re.search(DATE, line), line_array))

but that isn't needed in Python 2.

If that's a bit cryptic, here it is again as a list comp:

DATE = r'\d{2}/\d{2}/\d{4}'
line_array = [line for line in line_array if re.search(DATE, line)]


Let's get rid of the whitespace at the same time!

line_array = [line.strip() for line in line_array if 
              re.search(DATE, line)]


And if that's still too cryptic ("what's a list comp?") here it is again 
expanded out in full:


temp = []
for line in line_array:
    if re.search(DATE, line):
        temp.append(line.strip())
line_array = temp


How does this work? It works because the two main re functions, 
re.match and re.search, return None when then regex isn't found, and a 
MatchObject when it is found. None has the property that it is 
considered "false" in a boolean context, while MatchObjects are always 
consider "true".

We don't care *where* the date is found in the string, only whether or 
not it is found, so there is no need to check the starting position.



-- 
Steven


More information about the Tutor mailing list