[Tutor] regex problem

Kent Johnson kent37 at tds.net
Wed Jan 5 12:33:32 CET 2005


If you search comp.lang.python for 'convert html text', the top four results all have solutions for 
this problem including a reference to this cookbook recipe:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52297

comp.lang.python can be found here:
http://groups-beta.google.com/group/comp.lang.python?hl=en&lr=&ie=UTF-8&c2coff=1

Kent


Michael Powe wrote:
> Hello,
> 
> I'm having erratic results with a regex.  I'm hoping someone can
> pinpoint the problem.
> 
> This function removes HTML formatting codes from a text email that is
> poorly exported -- it is supposed to be a text version of an HTML
> mailing, but it's basically just a text version of the HTML page.  I'm
> not after anything elaborate, but it has gotten to be a bit of an
> itch.  ;-)
> 
> def parseFile(inFile) :
>     import re
>     bSpace = re.compile("^ ")
>     multiSpace = re.compile(r"\s\s+")
>     nbsp = re.compile(r" ")
>     HTMLRegEx =
>     re.compile(r"(&lt;|<)/?((!--.*--)|(STYLE.*STYLE)|(P|BR|b|STRONG))/?(&gt;|>)
> ",re.I)
> 
>     f = open(inFile,"r")
>     lines = f.readlines()
>     newLines = []
>     for line in lines :
>         line = HTMLRegEx.sub(' ',line)
>         line = bSpace.sub('',line)
>         line = nbsp.sub(' ',line)
>         line = multiSpace.sub(' ',line)
>         newLines.append(line)
>     f.close()
>     return newLines
> 
> Now, the main issue I'm looking at is with the multiSpace regex.  When
> applied, this removes some blank lines but not others.  I don't want
> it to remove any blank lines, just contiguous multiple spaces in a
> line.
> 
> BTB, this also illustrates a difference between python and perl -- in
> perl, i can change "line" and it automatically changes the entry in
> the array; this doesn't work in python.  A bit annoying, actually.
> ;-)
> 
> Thanks for any help.  If there's a better way to do this, I'm open to
> suggestions on that regard, too.
> 
> mp
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
> 


More information about the Tutor mailing list