[Tutor] scraping and saving in file

Peter Otten __peter__ at web.de
Wed Dec 29 12:26:24 CET 2010


Tommy Kaas wrote:

> Steven D'Aprano wrote:
>> But in your case, the best way is not to use print at all. You are
>> writing
> to a
>> file -- write to the file directly, don't mess about with print.
>> Untested:
>> 
>> 
>> f = open('tabeltest.txt', 'w')
>> url = 'http://www.kaasogmulvad.dk/unv/python/tabeltest.htm'
>> soup = BeautifulSoup(urllib2.urlopen(url).read())
>> rows = soup.findAll('tr')
>> for tr in rows:
>>      cols = tr.findAll('td')
>>      output = "#".join(cols[i].string for i in (0, 1, 2, 3))
>>      f.write(output + '\n')  # don't forget the newline after each row
>> f.close()
> 
> Steven, thanks for the advice.
> I see the point. But now I have problems with the Danish characters. I get
> this:
> 
> Traceback (most recent call last):
>   File "C:/pythonlib/kursus/kommuner-regioner_ny.py", line 36, in <module>
>     f.write(output + '\n')  # don't forget the newline after each row
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in
> position 5: ordinal not in range(128)
> 
> I have tried to add # -*- coding: utf-8 -*- to the top of the script, but
> It doesn't help?

The coding cookie only affects unicode string constants in the source code, 
it doesn't change how the unicode data coming from BeautifulSoup is handled.
As I suspected in my other post you have to convert your data to a specific 
encoding (I use UTF-8 below) before you can write it to a file:

import urllib2 
import codecs
from BeautifulSoup import BeautifulSoup 

html = urllib2.urlopen(
    'http://www.kaasogmulvad.dk/unv/python/tabeltest.htm').read()
soup = BeautifulSoup(html)

with codecs.open('tabeltest.txt', "w", encoding="utf-8") as f:
    rows = soup.findAll('tr')
    for tr in rows:
        cols = tr.findAll('td')
        print >> f, "#".join(col.string for col in cols)

The with statement implicitly closes the file, so you can avoid f.close() at 
the end of the script.

Peter



More information about the Tutor mailing list