[ActivePython] Why does Python not return first line?

Benjamin Kaplan benjamin.kaplan at case.edu
Mon Mar 16 02:13:36 CET 2009

On Sun, Mar 15, 2009 at 8:14 PM, Gilles Ganault <nospam at nospam.com> wrote:

> Hello
> I'm stuck at why Python doesn't return the first line in this simple
> regex:
> ===========
> response = "<span>Address :</span></td>\r\t\t<td>\r\t\t\t3 Abbey Road,
> St Johns Wood <br />\r\t\t\tLondon, NW8 9AY\t\t</td>"
> re_address = re.compile('<span>Address
> :</span></td>.+?<td>(.+?)</td>',re.I | re.S | re.M)
> address = re_address.search(response)
> if address:
>        address = address.group(1).strip()
>        print "address is %s" % address
> else:
>        print "address not found"
> ===========
> C:\test.py
>                        London, NW8 9AY<br />
> ===========
> Could this be due to the non-printable characters like TAB or ENTER?
> FWIW, I think that the original web page I'm trying to parse is from a
> *nix host.

Actually, the problem is that the only newlines you have on there are Mac OS
Classic/Commodore newlines. Windows new lines date back to typewriters.
There are two characters in a Windows newline- a carriage return (\r), which
returns the cursor to the beginning of the line, and linefeed (\n) which
moves to the next line. I think what's happening is that Windows tries to
duplicate the commands from the typewritter- it returns to the beginning of
the line at the carriage return, but doesn't move to a new one. The second
half of the text overwrites the first half, and you get the problem you're
seeing. The only way I can think of to fix this is to search for any
carriage return not followed by a linefeed and add a linefeed in.

>  <http://mail.python.org/mailman/listinfo/python-list>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20090315/fbb6887f/attachment.html>

More information about the Python-list mailing list