[Tutor] scraping and saving in file SOLVED
Peter Otten
__peter__ at web.de
Wed Dec 29 13:45:57 CET 2010
Tommy Kaas wrote:
> With Stevens help about writing and Peters help about import codecs - and
> when I used \r\n instead of \r to give me new lines everything worked. I
> just thought that \n would be necessary? Thanks.
> Tommy
Newline handling varies across operating systems. If you are on Windows and
open a file in text mode your program sees plain "\n", but the data stored
on disk is "\r\n". Most other OSes don't mess with newlines.
If you always want "\r\n" you can rely on the csv module to write your data,
but the drawback is that you have to encode the strings manually:
import csv
import urllib2
from BeautifulSoup import BeautifulSoup
html = urllib2.urlopen(
'http://www.kaasogmulvad.dk/unv/python/tabeltest.htm').read()
soup = BeautifulSoup(html)
with open('tabeltest.txt', "wb") as f:
writer = csv.writer(f, delimiter="#")
rows = soup.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
writer.writerow([unicode(col.string).encode("utf-8")
for col in cols])
PS: It took me some time to figure out how deal with beautifulsoup's flavour
of unicode:
>>> import BeautifulSoup as bs
>>> s = bs.NavigableString(u"älpha")
>>> s
u'\xe4lpha'
>>> s.encode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 430, in encode
return self.decode().encode(encoding)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position
0: ordinal not in range(128)
>>> unicode(s).encode("utf-8") # heureka
'\xc3\xa4lpha'
More information about the Tutor
mailing list