string encoding regex problem
Peter Otten
__peter__ at web.de
Sat Aug 23 17:13:07 EDT 2014
Philipp Kraus wrote:
> I have create a short script:
>
> ---------
> #!/usr/bin/env python
>
> import re, urllib2
>
>
> def URLReader(url) :
> f = urllib2.urlopen(url)
> data = f.read()
> f.close()
> return data
>
>
> print re.match( "\<small\ \>.*\<\/small\>",
> URLReader("http://sourceforge.net/projects/boost/") )
> ---------
>
> Within the data the string "<small>boost_1_56_0.tar.gz</small>" should
> be machted, but I get always a None result on the re.match, re.search
> returns also a None.
>>> help(re.match)
Help on function match in module re:
match(pattern, string, flags=0)
Try to apply the pattern at the start of the string, returning
a match object, or None if no match was found.
As the string doesn't start with your regex re.match() is clearly wrong, but
re.search() works for me:
>>> import re, urllib2
>>>
>>>
>>> def URLReader(url) :
... f = urllib2.urlopen(url)
... data = f.read()
... f.close()
... return data
...
>>> data = URLReader("http://sourceforge.net/projects/boost/")
>>> re.search("\<small\ \>.*\<\/small\>", data)
<_sre.SRE_Match object at 0x7f282dd58718>
>>> _.group()
'<small >boost_1_56_pdf.7z</small>'
> I have tested the regex under http://regex101.com/ with the HTML code
> and on the page the regex is matched.
>
> Can you help me please to fix the problem, I don't understand that the
> match returns None
More information about the Python-list
mailing list