[ActivePython 126.96.36.199] Why does Python not return first line?
benjamin.kaplan at case.edu
Mon Mar 16 02:13:36 CET 2009
On Sun, Mar 15, 2009 at 8:14 PM, Gilles Ganault <nospam at nospam.com> wrote:
> I'm stuck at why Python doesn't return the first line in this simple
> response = "<span>Address :</span></td>\r\t\t<td>\r\t\t\t3 Abbey Road,
> St Johns Wood <br />\r\t\t\tLondon, NW8 9AY\t\t</td>"
> re_address = re.compile('<span>Address
> :</span></td>.+?<td>(.+?)</td>',re.I | re.S | re.M)
> address = re_address.search(response)
> if address:
> address = address.group(1).strip()
> print "address is %s" % address
> print "address not found"
> London, NW8 9AY<br />
> Could this be due to the non-printable characters like TAB or ENTER?
> FWIW, I think that the original web page I'm trying to parse is from a
> *nix host.
Actually, the problem is that the only newlines you have on there are Mac OS
Classic/Commodore newlines. Windows new lines date back to typewriters.
There are two characters in a Windows newline- a carriage return (\r), which
returns the cursor to the beginning of the line, and linefeed (\n) which
moves to the next line. I think what's happening is that Windows tries to
duplicate the commands from the typewritter- it returns to the beginning of
the line at the carriage return, but doesn't move to a new one. The second
half of the text overwrites the first half, and you get the problem you're
seeing. The only way I can think of to fix this is to search for any
carriage return not followed by a linefeed and add a linefeed in.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Python-list