[Tutor] regex problem

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Wed Jan 5 06:15:46 CET 2005



On Tue, 4 Jan 2005, Michael Powe wrote:

> def parseFile(inFile) :
>     import re
>     bSpace = re.compile("^ ")
>     multiSpace = re.compile(r"\s\s+")
>     nbsp = re.compile(r" ")
>     HTMLRegEx =
>     re.compile(r"(&lt;|<)/?((!--.*--)|(STYLE.*STYLE)|(P|BR|b|STRONG))/?(&gt;|>)
> ",re.I)
>
>     f = open(inFile,"r")
>     lines = f.readlines()
>     newLines = []
>     for line in lines :
>         line = HTMLRegEx.sub(' ',line)
>         line = bSpace.sub('',line)
>         line = nbsp.sub(' ',line)
>         line = multiSpace.sub(' ',line)
>         newLines.append(line)
>     f.close()
>     return newLines
>
> Now, the main issue I'm looking at is with the multiSpace regex.  When
> applied, this removes some blank lines but not others.  I don't want it
> to remove any blank lines, just contiguous multiple spaces in a line.


Hi Michael,

Do you have an example of a file where this bug takes place?  As far as I
can tell, since the processing is being done line-by-line, the program
shouldn't be losing any blank lines at all.

Do you mean that the 'multiSpace' pattern is eating the line-terminating
newlines?  If you don't want it to do this, you can modify the pattern
slightly.  '\s' is defined to be this group of characters:

    '[ \t\n\r\f\v]'

(from http://www.python.org/doc/lib/re-syntax.html)

So we can adjust our pattern from:

    r"\s\s+"

to

    r"[ \t\f\v][ \t\f\v]+"

so that we don't capture newlines or carriage returns.  Regular
expressions have a brace operator for dealing with repetition:
if we're looking for at least 2 or more
of some thing 'x', we can say:

    x{2,}

Another approach is to always rstrip() the newlines off, do the regex
processing, and then put them back in at the end.


There are some assumptions that the program makes about the HTML that you
might need to be careful of.  What does the program do if we pass it the
following string?

###
from StringIO import StringIO
sampleFile = """
<p
>hello world!<p
>
"""
###

Issues like these are already considered in the HTML parser modules in the
Standard Library, so if you can use HTMLParser, I'd strongly recommend it.


Good luck to you!



More information about the Tutor mailing list