[Tutor] scraping and saving in file

Wed Dec 29 12:26:24 CET 2010

Tommy Kaas wrote:

> Steven D'Aprano wrote:
>> But in your case, the best way is not to use print at all. You are
>> writing
> to a
>> file -- write to the file directly, don't mess about with print.
>> Untested:
>> 
>> 
>> f = open('tabeltest.txt', 'w')
>> url = 'http://www.kaasogmulvad.dk/unv/python/tabeltest.htm'
>> soup = BeautifulSoup(urllib2.urlopen(url).read())
>> rows = soup.findAll('tr')
>> for tr in rows:
>>      cols = tr.findAll('td')
>>      output = "#".join(cols[i].string for i in (0, 1, 2, 3))
>>      f.write(output + '\n')  # don't forget the newline after each row
>> f.close()
> 
> Steven, thanks for the advice.
> I see the point. But now I have problems with the Danish characters. I get
> this:
> 
> Traceback (most recent call last):
>   File "C:/pythonlib/kursus/kommuner-regioner_ny.py", line 36, in <module>
>     f.write(output + '\n')  # don't forget the newline after each row
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in
> position 5: ordinal not in range(128)
> 
> I have tried to add # -*- coding: utf-8 -*- to the top of the script, but
> It doesn't help?

The coding cookie only affects unicode string constants in the source code, 
it doesn't change how the unicode data coming from BeautifulSoup is handled.
As I suspected in my other post you have to convert your data to a specific 
encoding (I use UTF-8 below) before you can write it to a file:

import urllib2 
import codecs
from BeautifulSoup import BeautifulSoup 

html = urllib2.urlopen(
    'http://www.kaasogmulvad.dk/unv/python/tabeltest.htm').read()
soup = BeautifulSoup(html)

with codecs.open('tabeltest.txt', "w", encoding="utf-8") as f:
    rows = soup.findAll('tr')
    for tr in rows:
        cols = tr.findAll('td')
        print >> f, "#".join(col.string for col in cols)

The with statement implicitly closes the file, so you can avoid f.close() at 
the end of the script.

Peter