[Tutor] regex problem
Rich Krauter
rmkrauter at yahoo.com
Wed Jan 5 05:11:12 CET 2005
Michael Powe wrote:
> Hello,
>
> I'm having erratic results with a regex. I'm hoping someone can
> pinpoint the problem.
>
> This function removes HTML formatting codes from a text email that is
> poorly exported -- it is supposed to be a text version of an HTML
> mailing, but it's basically just a text version of the HTML page. I'm
> not after anything elaborate, but it has gotten to be a bit of an
> itch. ;-)
>
> def parseFile(inFile) :
> import re
> bSpace = re.compile("^ ")
> multiSpace = re.compile(r"\s\s+")
> nbsp = re.compile(r" ")
> HTMLRegEx =
> re.compile(r"(<|<)/?((!--.*--)|(STYLE.*STYLE)|(P|BR|b|STRONG))/?(>|>)
> ",re.I)
>
> f = open(inFile,"r")
> lines = f.readlines()
> newLines = []
> for line in lines :
> line = HTMLRegEx.sub(' ',line)
> line = bSpace.sub('',line)
> line = nbsp.sub(' ',line)
> line = multiSpace.sub(' ',line)
> newLines.append(line)
> f.close()
> return newLines
>
> Now, the main issue I'm looking at is with the multiSpace regex. When
> applied, this removes some blank lines but not others. I don't want
> it to remove any blank lines, just contiguous multiple spaces in a
> line.
>
Hi Michael,
If you use '\s\s+', and a line has ' \n' (space then newline) at the
end, the space and the newline will match and be substituted. If the
line ends in 'some chars\n' or the line is just '\n', the newline will
stay.
An alternate approach might be to first get rid of any leading or
trailing whitespace (including \r|\n), then get rid of 'internal'
repeated space, with string methods.
Parsing html using regexes is likely to break easily; HTMLParser is a
better solution, but may it may seem more complicated at first. It may
be worthwhile for you to look into that module; someone here would be
able to help if necessary.
Short of the HTMLParser approach, I would try to reduce the dependence
on regexes, using string methods where you can.
I would try something like this (untested) in your for loop to start:
for line in lines:
line = line.strip()
line = line.replace(' ',' ')
line = HTMLRegEx.sub(' ',line)
line = ' '.join(line.split())
newLines.append(line)
The only pattern left is HTMLRegEx. Using HTMLParser you could probably
remove regexes completely. When I used perl for most things, I don't
think I wrote a single script that didn't use regexes. Now, I use python
for most things, and I don't think I have a single python module that
imports re. Weird.
> BTB, this also illustrates a difference between python and perl -- in
> perl, i can change "line" and it automatically changes the entry in
> the array; this doesn't work in python. A bit annoying, actually.
> ;-)
>
I once had the same trouble with python - some features annoyed me
because they weren't like perl. Now I have the reverse problem, only
this time my annoyance is justified. :)
> Thanks for any help. If there's a better way to do this, I'm open to
> suggestions on that regard, too.
>
Good luck.
Rich
More information about the Tutor
mailing list