<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Style-Type" content="text/css">
<title></title>
<meta name="Generator" content="Cocoa HTML Writer">
<meta name="CocoaVersion" content="1187.4">
<style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; line-height: 15.0px; font: 12.0px Helvetica}
p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; line-height: 15.0px; font: 12.0px Helvetica; min-height: 14.0px}
p.p3 {margin: 0.0px 0.0px 0.0px 12.0px; line-height: 14.0px; font: 12.0px Helvetica; color: #011892}
p.p4 {margin: 0.0px 0.0px 0.0px 12.0px; line-height: 14.0px; font: 12.0px Helvetica; color: #011892; min-height: 14.0px}
p.p5 {margin: 0.0px 0.0px 0.0px 24.0px; font: 12.0px Helvetica; color: #008e00}
p.p6 {margin: 0.0px 0.0px 0.0px 12.0px; font: 12.0px Helvetica; color: #011892; min-height: 14.0px}
p.p7 {margin: 0.0px 0.0px 0.0px 12.0px; font: 12.0px Helvetica; color: #011892}
p.p8 {margin: 0.0px 0.0px 0.0px 48.0px; font: 12.0px Helvetica; color: #011892}
p.p9 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; color: #000000; min-height: 14.0px}
p.p10 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; color: #000000}
p.p11 {margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Helvetica; min-height: 14.0px}
p.p12 {margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Helvetica}
</style>
</head>
<body>
<p class="p1">Hi,</p>
<p class="p2"><br></p>
<p class="p1">On 2014-08-16 09:01:57 +0000, Peter Otten said:</p>
<p class="p2"><br></p>
<p class="p3">Philipp Kraus wrote:</p>
<p class="p4"><br></p>
<p class="p5">The code works till last week correctly, I don't change the pattern.<span class="Apple-converted-space"> </span></p>
<p class="p6"><br></p>
<p class="p7">Websites' contents and structure change sometimes.</p>
<p class="p6"><br></p>
<p class="p5">My question is, can it be a problem with string encoding?<span class="Apple-converted-space"> </span></p>
<p class="p6"><br></p>
<p class="p7">Your regex is all-ascii. So an encoding problem is very unlikely.</p>
<p class="p6"><br></p>
<p class="p5">found = re.search( "<a<span class="Apple-converted-space"> </span></p>
<p class="p5">href=\"/projects/boost/files/latest/download\?source=files\"<span class="Apple-converted-space"> </span></p>
<p class="p5">title=\"/boost/(.*)",</p>
<p class="p5">data)</p>
<p class="p6"><br></p>
<p class="p5">Did I mask the question mark and quotes</p>
<p class="p5">correctly?</p>
<p class="p6"><br></p>
<p class="p7">Yes.</p>
<p class="p6"><br></p>
<p class="p7">A quick check...</p>
<p class="p6"><br></p>
<p class="p8">data = urllib.urlopen("http://sourceforge.net/projects/boost/files/boost/").read()</p>
<p class="p8">re.compile("/projects/boost/files/latest/download\?source=files.*?>").findall(data)</p>
<p class="p7">['/projects/boost/files/latest/download?source=files" title="/boost-docs/1.56.0/boost_1_56_pdf.7z:<span class="Apple-converted-space"> </span>released on 2014-08-14 16:35:00 UTC">']</p>
<p class="p6"><br></p>
<p class="p7">...reveals that the matching link has "/boost-docs/" in its title, so the</p>
<p class="p7"><span class="Apple-converted-space"> </span>site contents probably did change.<span class="Apple-converted-space"> </span></p>
<p class="p9"><br></p>
<p class="p10">I have create a short script:</p>
<p class="p11"><br></p>
<p class="p12">---------</p>
<p class="p12">#!/usr/bin/env python</p>
<p class="p11"><br></p>
<p class="p12">import re, urllib2</p>
<p class="p11"><br></p>
<p class="p11"><br></p>
<p class="p12">def URLReader(url) :</p>
<p class="p12"><span class="Apple-converted-space"> </span>f = urllib2.urlopen(url)</p>
<p class="p12"><span class="Apple-converted-space"> </span>data = f.read()</p>
<p class="p12"><span class="Apple-converted-space"> </span>f.close()</p>
<p class="p12"><span class="Apple-converted-space"> </span>return data</p>
<p class="p11"><br></p>
<p class="p11"><br></p>
<p class="p12">print re.match( "\<small\ \>.*\<\/small\>", URLReader("http://sourceforge.net/projects/boost/") )</p>
<p class="p12">---------</p>
<p class="p11"><br></p>
<p class="p12">Within the data the string "<small>boost_1_56_0.tar.gz</small>" should be machted, but I get always a None result on the re.match, re.search returns also a None.</p>
<p class="p12">I have tested the regex under <a href="http://regex101.com/">http://regex101.com/</a> with the HTML code and on the page the regex is matched.</p>
<p class="p11"><br></p>
<p class="p12">Can you help me please to fix the problem, I don't understand that the match returns None</p>
<p class="p11"><br></p>
<p class="p12">Thanks</p>
<p class="p11"><br></p>
<p class="p12">Phil</p>
</body>
</html>