[Tutor] scraping and saving in file
Peter Otten
__peter__ at web.de
Wed Dec 29 12:26:24 CET 2010
Tommy Kaas wrote:
> Steven D'Aprano wrote:
>> But in your case, the best way is not to use print at all. You are
>> writing
> to a
>> file -- write to the file directly, don't mess about with print.
>> Untested:
>>
>>
>> f = open('tabeltest.txt', 'w')
>> url = 'http://www.kaasogmulvad.dk/unv/python/tabeltest.htm'
>> soup = BeautifulSoup(urllib2.urlopen(url).read())
>> rows = soup.findAll('tr')
>> for tr in rows:
>> cols = tr.findAll('td')
>> output = "#".join(cols[i].string for i in (0, 1, 2, 3))
>> f.write(output + '\n') # don't forget the newline after each row
>> f.close()
>
> Steven, thanks for the advice.
> I see the point. But now I have problems with the Danish characters. I get
> this:
>
> Traceback (most recent call last):
> File "C:/pythonlib/kursus/kommuner-regioner_ny.py", line 36, in <module>
> f.write(output + '\n') # don't forget the newline after each row
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in
> position 5: ordinal not in range(128)
>
> I have tried to add # -*- coding: utf-8 -*- to the top of the script, but
> It doesn't help?
The coding cookie only affects unicode string constants in the source code,
it doesn't change how the unicode data coming from BeautifulSoup is handled.
As I suspected in my other post you have to convert your data to a specific
encoding (I use UTF-8 below) before you can write it to a file:
import urllib2
import codecs
from BeautifulSoup import BeautifulSoup
html = urllib2.urlopen(
'http://www.kaasogmulvad.dk/unv/python/tabeltest.htm').read()
soup = BeautifulSoup(html)
with codecs.open('tabeltest.txt', "w", encoding="utf-8") as f:
rows = soup.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
print >> f, "#".join(col.string for col in cols)
The with statement implicitly closes the file, so you can avoid f.close() at
the end of the script.
Peter
More information about the Tutor
mailing list