using urllib2

Jeff McNeil jeff at jmcneil.net
Mon Jun 30 00:30:02 CEST 2008


I didn't spend a lot of time debugging that code -- I've been using
beautiful soup a lot at work lately and really pulled that out of
memory at about 2:00 AM a couple days ago.

In the 5 minute I spent on it, it appeared that the definitions were
setup like so:

<table class="luna-Ent">
<tr>
<td>Blah</td>
<td><perhaps a span tag></span>Definition></td>
</tr>
</table>

I was attempting to find all of the tables with that class assigned
(thus the dictionary passed to findAll) and then, using the -1, grab
the contents of the last 'td' defined within the definition table.
The second -1 was an index into all of the contents of the last TD,
attempting to pull the definition string -- I saw a few span tags in
there that don't really matter.

Grabbing definitions like this is troublesome.  If they change their
interface (i.e. HTML display), then you'll have to go back and change
what you're doing.   I was really just intending to give you a few
pointers -- not provide production quality code.

If you need definitions, another place you may want to look would be
google's JSON search API.  You may be able to search for
'define:word', and then you don't have to rely on the screen
scraping.

Thanks!

Jeff

On Jun 29, 4:04 pm, Alexnb <alexnbr... at gmail.com> wrote:
> Okay, now I ran in it the shell, and this is what happened:
>
> >>> for tabs in soup.findAll('table', {'class': 'luna-Ent'}):
>
> ...     tabs.findAll('td')[-1].contents[-1].string
> ...
> u' '
> u' '
> u' '
> u' '
> u' '
> u'not complex or compound; single. '
> u' '
> u' '
> u' '
> u' '
> u' '
> u'inconsequential or rudimentary. '
> u'unlearned; ignorant. '
> u' '
> u'unsophisticated; naive; credulous. '
> u' '
> u'not mixed. '
> u' '
> u'not mixed. '
> u' '
> u' '
> u' '
> u' '
> u'). '
> u' '
> u'(of a lens) having two optical surfaces only. '
> u'an ignorant, foolish, or gullible person. '
> u'something simple, unmixed, or uncompounded. '
> u'cords for controlling the warp threads in forming the shed on draw-looms.
> '
> u'a person of humble origins; commoner. '
> u' '
>
>
>
> However, the definitions are there. I printed the actual soup and they were
> there in the format they always were in. So what is the deal!?!
>
> >>> soup.findAll('table', {'class': 'luna-Ent'})
>
> [<table class="luna-Ent"><tr><td valign="top" class="dn">1.</td><td
> valign="top">easy to understand, deal with, use, etc.: a simple matter;
> simple tools.  </td></tr></table>
>
> See there is the first one in the shell, I mean it is there, but the for
> loop can't find it. I am wondering, because the above
> soup.findAll('table'..etc. makes it a list. Do you think that has anything
> to do with the problem?
>
> Alexnb wrote:
>
> > Actually after looking at this, the code is preactically the same, except
> > the definitions. So what COULD be going wrong here?
>
> > Also, I ran the program and decided to print the whole list of definitions
> > straight off BeautifulSoup, and I got an interesting result:
>
> > What word would you like to define: simple
> > [u' ', u' ', u' ', u' ', u' ', u'not complex or compound; single.
>
> > those are the first 5 definitions. and later on, it does the same thing.
> > it only sees a space, any ideas?
>
> > Alexnb wrote:
>
> >> Okay, so i've hit a new snag and can't seem to figure out what is wrong.
> >> What is happening is the first 4 definitions of the word "simple" don't
> >> show up. The html is basicly the same, with the exception of noun turning
> >> into adj. Ill paste the html of the word cheese, and then the one for
> >> simple, and the code I am using to do the work.
>
> >> line of html for the 2nd def of cheese:
>
> >> <table class="luna-Ent"><tr><td valign="top" class="dn">2.</td><td
> >> valign="top">a definite mass of this substance, often in the shape of a
> >> wheel or cylinder. </td></tr></table>
>
> >> line of html for the 2nd def of simple:
>
> >> <table class="luna-Ent"><tr><td valign="top" class="dn">2.</td><td
> >> valign="top">not elaborate or artificial; plain: a simple style.
> >> </td></tr></table>
>
> >> code:
>
> >> import urllib
> >> from BeautifulSoup import BeautifulSoup
>
> >> def get_defs(term):
> >>     soup =
> >> BeautifulSoup(urllib.urlopen('http://dictionary.reference.com/search?q=%s'
> >> % term))
>
> >>     for tabs in soup.findAll('table', {'class': 'luna-Ent'}):
> >>         yield tabs.findAll('td')[-1].contents[-1].string
>
> >> word = raw_input("What word would you like to define: ")
>
> >> mainList = list(get_defs(word))
>
> >> n=0
> >> q = 1
>
> >> for x in mainList:
> >>     print str(q)+".  "+str(mainList[n])
> >>     q=q+1
> >>     n=n+1
>
> >> Now, I don't think it is the italics because one of the definitions that
> >> worked had them in it in the same format. Any Ideas??!
>
> >> Jeff McNeil-2 wrote:
>
> >>> On Jun 29, 12:50 pm, Alexnb <alexnbr... at gmail.com> wrote:
> >>>> No I figured it out. I guess I never knew that you aren't supposed to
> >>>> split a
> >>>> url like "http://www.goo\
> >>>> gle.com" But I did and it gave me all those errors. Anyway, I had a
> >>>> question. On the original code you had this for loop:
>
> >>>> for tabs in soup.findAll('table', {'class': 'luna-Ent'}):
> >>>>         yield tabs.findAll('td')[-1].contents[-1].string
>
> >>>> I hate to be a pain, but I was looking at the BeautifulSoup docs, and
> >>>> found
> >>>> the findAll thing. But I want to know why you put "for tabs," also why
> >>>> you
> >>>> need the "'table', {'class': 'luna-Ent'}):" Like why the curly braces
> >>>> and
> >>>> whatnot?
>
> >>>> Jeff McNeil-2 wrote:
>
> >>>> > On Jun 27, 10:26 pm, Alexnb <alexnbr... at gmail.com> wrote:
> >>>> >> Okay, so I copied your code(and just so you know I am on a mac right
> >>>> now
> >>>> >> and
> >>>> >> i am using pydev in eclipse), and I got these errors, any idea what
> >>>> is
> >>>> >> up?
>
> >>>> >> Traceback (most recent call last):
> >>>> >>   File
> >>>> >> "/Users/Alex/Documents/workspace/beautifulSoup/src/firstExample.py",
> >>>> >> line 14, in <module>
> >>>> >>     print list(get_defs("cheese"))
> >>>> >>   File
> >>>> >> "/Users/Alex/Documents/workspace/beautifulSoup/src/firstExample.py",
> >>>> >> line 9, in get_defs
> >>>> >>     dictionary.reference.com/search?q=%s' % term))
> >>>> >>   File
>
> >>>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/url
> >>>> lib.py",
> >>>> >> line 82, in urlopen
> >>>> >>     return opener.open(url)
> >>>> >>   File
>
> >>>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/url
> >>>> lib.py",
> >>>> >> line 190, in open
> >>>> >>     return getattr(self, name)(url)
> >>>> >>   File
>
> >>>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/url
> >>>> lib.py",
> >>>> >> line 325, in open_http
> >>>> >>     h.endheaders()
> >>>> >>   File
>
> >>>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/htt
> >>>> plib.py",
> >>>> >> line 856, in endheaders
> >>>> >>     self._send_output()
> >>>> >>   File
>
> >>>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/htt
> >>>> plib.py",
> >>>> >> line 728, in _send_output
> >>>> >>     self.send(msg)
> >>>> >>   File
>
> >>>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/htt
> >>>> plib.py",
> >>>> >> line 695, in send
> >>>> >>     self.connect()
> >>>> >>   File
>
> >>>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/htt
> >>>> plib.py",
> >>>> >> line 663, in connect
> >>>> >>     socket.SOCK_STREAM):
> >>>> >> IOError: [Errno socket error] (8, 'nodename nor servname provided,
> >>>> or not
> >>>> >> known')
>
> >>>> >> Sorry if it is hard to read.
>
> >>>> >> Jeff McNeil-2 wrote:
>
> >>>> >> > Well, what about pulling that data out using Beautiful soup? If
> >>>> you
> >>>> >> > know the table name and whatnot, try something like this:
>
> >>>> >> > #!/usr/bin/python
>
> >>>> >> > import urllib
> >>>> >> > from BeautifulSoup import BeautifulSoup
>
> >>>> >> > def get_defs(term):
> >>>> >> >     soup = BeautifulSoup(urllib.urlopen('http://
> >>>> >> > dictionary.reference.com/search?q=%s' % term))
>
> >>>> >> >     for tabs in soup.findAll('table', {'class': 'luna-Ent'}):
> >>>> >> >         yield tabs.findAll('td')[-1].contents[-1].string
>
> >>>> >> > print list(get_defs("frog"))
>
> >>>> >> > jeff at martian:~$ python test.py
> >>>> >> > [u'any tailless, stout-bodied amphibian of the order Anura,
> >>>> including
> >>>> >> > the smooth, moist-skinned frog species that live in a damp or
> >>>> >> > semiaquatic habitat and the warty, drier-skinned toad species that
> >>>> are
> >>>> >> > mostly terrestrial as adults. ', u' ', u' ', u'a French person or
> >>>> a
> >>>> >> > person of French descent. ', u'a small holder made of heavy
> >>>> material,
> >>>> >> > placed in a bowl or vase to hold flower stems in position. ', u'a
> >>>> >> > recessed panel on one of the larger faces of a brick or the like.
> >>>> ',
> >>>> >> > u' ', u'to hunt and catch frogs. ', u'French or Frenchlike. ',
> >>>> u'an
> >>>> >> > ornamental fastening for the front of a coat, consisting of a
> >>>> button
> >>>> >> > and a loop through which it passes. ', u'a sheath suspended from a
> >>>> >> > belt and supporting a scabbard. ', u'a device at the intersection
> >>>> of
> >>>> >> > two tracks to permit the wheels and flanges on one track to cross
> >>>> or
> >>>> >> > branch from the other. ', u'a triangular mass of elastic, horny
> >>>> >> > substance in the middle of the sole of the foot of a horse or
> >>>> related
> >>>> >> > animal. ']
>
> >>>> >> > HTH,
>
> >>>> >> > Jeff
>
> >>>> >> > On Jun 27, 7:28 pm, Alexnb <alexnbr... at gmail.com> wrote:
> >>>> >> >> I have read that multiple times. It is hard to understand but it
> >>>> did
> >>>> >> help
> >>>> >> >> a
> >>>> >> >> little. But I found a bit of a work-around for now which is not
> >>>> what I
> >>>> >> >> ultimately want. However, even when I can get to the page I want
> >>>> lets
> >>>> >> >> say,
> >>>> >> >> "Http://dictionary.reference.com/browse/cheese", I look on
> >>>> firebug,
> >>>> >> and
> >>>> >> >> extension and see the definition in javascript,
>
> >>>> >> >> <table class="luna-Ent">
> >>>> >> >> <tbody>
> >>>> >> >> <tr>
> >>>> >> >> <td class="dn" valign="top">1.</td>
> >>>> >> >> <td valign="top">the curd of milk separated from the whey and
> >>>> prepared
> >>>> >> in
> >>>> >> >> many ways as a food. </td>
>
> >>>> >> >> Jeff McNeil-2 wrote:
>
> >>>> >> >> > the problem being that if I use code like this to get the html
> >>>> of
> >>>> >> that
>
> >>>> >> >> > page in python:
>
> >>>> >> >> > response = urllib2.urlopen("the webiste....")
> >>>> >> >> > html = response.read()
> >>>> >> >> > print html
>
> >>>> >> >> > then, I get a bunch of stuff, but it doesn't show me the code
> >>>> with
> >>>> >> the
> >>>> >> >> > table that the definition is in. So I am asking how do I access
> >>>> this
> >>>> >> >> > javascript. Also, if someone could point me to a better
> >>>> reference
> >>>> >> than
> >>>> >> >> the
> >>>> >> >> > last one, because that really doesn't tell me much, whether it
> >>>> be a
> >>>> >> >> book
> >>>> >> >> > or anything.
>
> >>>> >> >> > I stumbled across this a while back:
> >>>> >> >> >http://www.voidspace.org.uk/python/articles/urllib2.shtml.
> >>>> >> >> > It covers quite a bit. The urllib2 module is pretty
> >>>> straightforward
> >>>> >> >> > once you've used it a few times.  Some of the class naming and
> >>>> >> whatnot
> >>>> >> >> > takes a bit of getting used to (I...
>
> read more »




More information about the Python-list mailing list