[Tutor] regex problem

Wed Jan 5 05:11:12 CET 2005

Michael Powe wrote:
> Hello,
> 
> I'm having erratic results with a regex.  I'm hoping someone can
> pinpoint the problem.
> 
> This function removes HTML formatting codes from a text email that is
> poorly exported -- it is supposed to be a text version of an HTML
> mailing, but it's basically just a text version of the HTML page.  I'm
> not after anything elaborate, but it has gotten to be a bit of an
> itch.  ;-)
> 
> def parseFile(inFile) :
>     import re
>     bSpace = re.compile("^ ")
>     multiSpace = re.compile(r"\s\s+")
>     nbsp = re.compile(r"&nbsp;")
>     HTMLRegEx =
>     re.compile(r"(&lt;|<)/?((!--.*--)|(STYLE.*STYLE)|(P|BR|b|STRONG))/?(&gt;|>)
> ",re.I)
> 
>     f = open(inFile,"r")
>     lines = f.readlines()
>     newLines = []
>     for line in lines :
>         line = HTMLRegEx.sub(' ',line)
>         line = bSpace.sub('',line)
>         line = nbsp.sub(' ',line)
>         line = multiSpace.sub(' ',line)
>         newLines.append(line)
>     f.close()
>     return newLines
> 
> Now, the main issue I'm looking at is with the multiSpace regex.  When
> applied, this removes some blank lines but not others.  I don't want
> it to remove any blank lines, just contiguous multiple spaces in a
> line.
> 

Hi Michael,

If you use '\s\s+', and a line has ' \n' (space then newline) at the 
end, the space and the newline will match and be substituted. If the 
line ends in 'some chars\n' or the line is just '\n', the newline will 
stay.

An alternate approach might be to first get rid of any leading or 
trailing whitespace (including \r|\n), then get rid of 'internal' 
repeated space, with string methods.

Parsing html using regexes is likely to break easily; HTMLParser is a 
better solution, but may it may seem more complicated at first. It may 
be worthwhile for you to look into that module; someone here would be 
able to help if necessary.

Short of the HTMLParser approach, I would try to reduce the dependence 
on regexes, using string methods where you can.

I would try something like this (untested) in your for loop to start:

for line in lines:
     line = line.strip()
     line = line.replace('&nbsp;',' ')
     line = HTMLRegEx.sub(' ',line)
     line = ' '.join(line.split())
     newLines.append(line)

The only pattern left is HTMLRegEx. Using HTMLParser you could probably 
remove regexes completely. When I used perl for most things, I don't 
think I wrote a single script that didn't use regexes. Now, I use python 
for most things, and I don't think I have a single python module that 
imports re. Weird.

> BTB, this also illustrates a difference between python and perl -- in
> perl, i can change "line" and it automatically changes the entry in
> the array; this doesn't work in python.  A bit annoying, actually.
> ;-)
> 

I once had the same trouble with python - some features annoyed me 
because they weren't like perl. Now I have the reverse problem, only 
this time my annoyance is justified. :)

> Thanks for any help.  If there's a better way to do this, I'm open to
> suggestions on that regard, too.
> 

Good luck.

Rich