[Tutor] scraping and saving in file

Knacktus knacktus at googlemail.com
Wed Dec 29 11:41:24 CET 2010


Am 29.12.2010 10:54, schrieb Tommy Kaas:
> Hi,
>
> I’m trying to learn basic web scraping and starting from scratch. I’m
> using Activepython 2.6.6
>
> I have uploaded a simple table on my web page and try to scrape it and
> will save the result in a text file. I will separate the columns in the
> file with #.
>
> It works fine but besides # I also get spaces between the columns in the
> text file. How do I avoid that?
>
> This is the script:
>
> import urllib2
>
> from BeautifulSoup import BeautifulSoup
>
> f = open('tabeltest.txt', 'w')
>
> soup =
> BeautifulSoup(urllib2.urlopen('http://www.kaasogmulvad.dk/unv/python/tabeltest.htm').read())
>
> rows = soup.findAll('tr')
>
> for tr in rows:
>
>      cols = tr.findAll('td')
>
>      print >> f,
> cols[0].string,'#',cols[1].string,'#',cols[2].string,'#',cols[3].string

You can strip the whitespaces from the strings. I assume the 
"string"-attribute returns a string (I don't now the API of Beautiful 
Soup) E.g.:
cols[0].string.strip()

Also, you can use join() to create the complete string:

resulting_string = "#".join([col.string.strip() for col in cols])

The long version without list comprehension (just for illustration, 
better use list comprehension):

resulting_string = "#".join([cols[0].string.strip(), 
cols[1].string.strip(), cols[2].string.strip(), cols[3].string.strip(), 
cols[4].string.strip()])

HTH,

Jan




>
> f.close()
>
> And the text file looks like this:
>
> Kommunenr # Kommune # Region # Regionsnr
>
> 101 # København # Hovedstaden # 1084
>
> 147 # Frederiksberg # Hovedstaden # 1084
>
> 151 # Ballerup # Hovedstaden # 1084
>
> 153 # Brøndby # Hovedstaden # 1084
>
> 155 # Dragør # Hovedstaden # 1084
>
> Thanks in advance
>
> Tommy Kaas
>
> Kaas & Mulvad
>
> Lykkesholms Alle 2A, 3.
>
> 1902 Frederiksberg C
>
> Mobil: 27268818
>
> Mail: tommy.kaas at kaasogmulvad.dk <mailto:tommy.kaas at kaasogmulvad.dk>
>
> Web: www.kaasogmulvad.dk <http://www.kaasogmulvad.dk>
>
>
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor



More information about the Tutor mailing list