using urllib2

Alexnb alexnbryan at gmail.com
Sun Jun 29 20:52:25 CEST 2008


Okay, so i've hit a new snag and can't seem to figure out what is wrong. What
is happening is the first 4 definitions of the word "simple" don't show up.
The html is basicly the same, with the exception of noun turning into adj.
Ill paste the html of the word cheese, and then the one for simple, and the
code I am using to do the work. 

line of html for the 2nd def of cheese:

<table class="luna-Ent"><tr><td valign="top" class="dn">2.</td><td
valign="top">a definite mass of this substance, often in the shape of a
wheel or cylinder. </td></tr></table>

line of html for the 2nd def of simple:

<table class="luna-Ent"><tr><td valign="top" class="dn">2.</td><td
valign="top">not elaborate or artificial; plain: a simple style. 
</td></tr></table>

code:

import urllib
from BeautifulSoup import BeautifulSoup


def get_defs(term):
    soup =
BeautifulSoup(urllib.urlopen('http://dictionary.reference.com/search?q=%s' %
term))

    for tabs in soup.findAll('table', {'class': 'luna-Ent'}):
        yield tabs.findAll('td')[-1].contents[-1].string

word = raw_input("What word would you like to define: ")

mainList = list(get_defs(word))

n=0 
q = 1

for x in mainList:
    print str(q)+".  "+str(mainList[n])
    q=q+1
    n=n+1

Now, I don't think it is the italics because one of the definitions that
worked had them in it in the same format. Any Ideas??!


Jeff McNeil-2 wrote:
> 
> On Jun 29, 12:50 pm, Alexnb <alexnbr... at gmail.com> wrote:
>> No I figured it out. I guess I never knew that you aren't supposed to
>> split a
>> url like "http://www.goo\
>> gle.com" But I did and it gave me all those errors. Anyway, I had a
>> question. On the original code you had this for loop:
>>
>> for tabs in soup.findAll('table', {'class': 'luna-Ent'}):
>>         yield tabs.findAll('td')[-1].contents[-1].string
>>
>> I hate to be a pain, but I was looking at the BeautifulSoup docs, and
>> found
>> the findAll thing. But I want to know why you put "for tabs," also why
>> you
>> need the "'table', {'class': 'luna-Ent'}):" Like why the curly braces and
>> whatnot?
>>
>> Jeff McNeil-2 wrote:
>>
>> > On Jun 27, 10:26 pm, Alexnb <alexnbr... at gmail.com> wrote:
>> >> Okay, so I copied your code(and just so you know I am on a mac right
>> now
>> >> and
>> >> i am using pydev in eclipse), and I got these errors, any idea what is
>> >> up?
>>
>> >> Traceback (most recent call last):
>> >>   File
>> >> "/Users/Alex/Documents/workspace/beautifulSoup/src/firstExample.py",
>> >> line 14, in <module>
>> >>     print list(get_defs("cheese"))
>> >>   File
>> >> "/Users/Alex/Documents/workspace/beautifulSoup/src/firstExample.py",
>> >> line 9, in get_defs
>> >>     dictionary.reference.com/search?q=%s' % term))
>> >>   File
>> >>
>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/url
>> lib.py",
>> >> line 82, in urlopen
>> >>     return opener.open(url)
>> >>   File
>> >>
>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/url
>> lib.py",
>> >> line 190, in open
>> >>     return getattr(self, name)(url)
>> >>   File
>> >>
>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/url
>> lib.py",
>> >> line 325, in open_http
>> >>     h.endheaders()
>> >>   File
>> >>
>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/htt
>> plib.py",
>> >> line 856, in endheaders
>> >>     self._send_output()
>> >>   File
>> >>
>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/htt
>> plib.py",
>> >> line 728, in _send_output
>> >>     self.send(msg)
>> >>   File
>> >>
>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/htt
>> plib.py",
>> >> line 695, in send
>> >>     self.connect()
>> >>   File
>> >>
>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/htt
>> plib.py",
>> >> line 663, in connect
>> >>     socket.SOCK_STREAM):
>> >> IOError: [Errno socket error] (8, 'nodename nor servname provided, or
>> not
>> >> known')
>>
>> >> Sorry if it is hard to read.
>>
>> >> Jeff McNeil-2 wrote:
>>
>> >> > Well, what about pulling that data out using Beautiful soup? If you
>> >> > know the table name and whatnot, try something like this:
>>
>> >> > #!/usr/bin/python
>>
>> >> > import urllib
>> >> > from BeautifulSoup import BeautifulSoup
>>
>> >> > def get_defs(term):
>> >> >     soup = BeautifulSoup(urllib.urlopen('http://
>> >> > dictionary.reference.com/search?q=%s' % term))
>>
>> >> >     for tabs in soup.findAll('table', {'class': 'luna-Ent'}):
>> >> >         yield tabs.findAll('td')[-1].contents[-1].string
>>
>> >> > print list(get_defs("frog"))
>>
>> >> > jeff at martian:~$ python test.py
>> >> > [u'any tailless, stout-bodied amphibian of the order Anura,
>> including
>> >> > the smooth, moist-skinned frog species that live in a damp or
>> >> > semiaquatic habitat and the warty, drier-skinned toad species that
>> are
>> >> > mostly terrestrial as adults. ', u' ', u' ', u'a French person or a
>> >> > person of French descent. ', u'a small holder made of heavy
>> material,
>> >> > placed in a bowl or vase to hold flower stems in position. ', u'a
>> >> > recessed panel on one of the larger faces of a brick or the like. ',
>> >> > u' ', u'to hunt and catch frogs. ', u'French or Frenchlike. ', u'an
>> >> > ornamental fastening for the front of a coat, consisting of a button
>> >> > and a loop through which it passes. ', u'a sheath suspended from a
>> >> > belt and supporting a scabbard. ', u'a device at the intersection of
>> >> > two tracks to permit the wheels and flanges on one track to cross or
>> >> > branch from the other. ', u'a triangular mass of elastic, horny
>> >> > substance in the middle of the sole of the foot of a horse or
>> related
>> >> > animal. ']
>>
>> >> > HTH,
>>
>> >> > Jeff
>>
>> >> > On Jun 27, 7:28 pm, Alexnb <alexnbr... at gmail.com> wrote:
>> >> >> I have read that multiple times. It is hard to understand but it
>> did
>> >> help
>> >> >> a
>> >> >> little. But I found a bit of a work-around for now which is not
>> what I
>> >> >> ultimately want. However, even when I can get to the page I want
>> lets
>> >> >> say,
>> >> >> "Http://dictionary.reference.com/browse/cheese", I look on firebug,
>> >> and
>> >> >> extension and see the definition in javascript,
>>
>> >> >> <table class="luna-Ent">
>> >> >> <tbody>
>> >> >> <tr>
>> >> >> <td class="dn" valign="top">1.</td>
>> >> >> <td valign="top">the curd of milk separated from the whey and
>> prepared
>> >> in
>> >> >> many ways as a food. </td>
>>
>> >> >> Jeff McNeil-2 wrote:
>>
>> >> >> > the problem being that if I use code like this to get the html of
>> >> that
>>
>> >> >> > page in python:
>>
>> >> >> > response = urllib2.urlopen("the webiste....")
>> >> >> > html = response.read()
>> >> >> > print html
>>
>> >> >> > then, I get a bunch of stuff, but it doesn't show me the code
>> with
>> >> the
>> >> >> > table that the definition is in. So I am asking how do I access
>> this
>> >> >> > javascript. Also, if someone could point me to a better reference
>> >> than
>> >> >> the
>> >> >> > last one, because that really doesn't tell me much, whether it be
>> a
>> >> >> book
>> >> >> > or anything.
>>
>> >> >> > I stumbled across this a while back:
>> >> >> >http://www.voidspace.org.uk/python/articles/urllib2.shtml.
>> >> >> > It covers quite a bit. The urllib2 module is pretty
>> straightforward
>> >> >> > once you've used it a few times.  Some of the class naming and
>> >> whatnot
>> >> >> > takes a bit of getting used to (I found that to be the most
>> >> confusing
>> >> >> > bit).
>>
>> >> >> > On Jun 27, 1:41 pm, Alexnb <alexnbr... at gmail.com> wrote:
>> >> >> >> Okay, I tried to follow that, and it is kinda hard. But since
>> you
>> >> >> >> obviously
>> >> >> >> know what you are doing, where did you learn this? Or where can
>> I
>> >> >> learn
>> >> >> >> this?
>>
>> >> >> >> Maric Michaud wrote:
>>
>> >> >> >> > Le Friday 27 June 2008 10:43:06 Alexnb, vous avez écrit :
>> >> >> >> >> I have never used the urllib or the urllib2. I really have
>> >> looked
>> >> >> >> online
>> >> >> >> >> for help on this issue, and mailing lists, but I can't figure
>> >> out
>> >> >> my
>> >> >> >> >> problem because people haven't been helping me, which is why
>> I
>> >> am
>> >> >> >> here!
>> >> >> >> >> :].
>> >> >> >> >> Okay, so basically I want to be able to submit a word to
>> >> >> >> dictionary.com
>> >> >> >> >> and
>> >> >> >> >> then get the definitions. However, to start off learning
>> >> urllib2, I
>> >> >> >> just
>> >> >> >> >> want to do a simple google search. Before you get mad, what I
>> >> have
>> >> >> >> found
>> >> >> >> >> on
>> >> >> >> >> urllib2 hasn't helped me. Anyway, How would you go about
>> doing
>> >> >> this.
>> >> >> >> No,
>> >> >> >> >> I
>> >> >> >> >> did not post the html, but I mean if you want, right click on
>> >> your
>> >> >> >> >> browser
>> >> >> >> >> and hit view source of the google homepage. Basically what I
>> >> want
>> >> >> to
>> >> >> >> know
>> >> >> >> >> is how to submit the values(the search term) and then search
>> for
>> >> >> that
>> >> >> >> >> value. Heres what I know:
>>
>> >> >> >> >> import urllib2
>> >> >> >> >> response = urllib2.urlopen("http://www.google.com/")
>> >> >> >> >> html = response.read()
>> >> >> >> >> print html
>>
>> >> >> >> >> Now I know that all this does is print the source, but thats
>> >> about
>> >> >> all
>> >> >> >> I
>> >> >> >> >> know. I know it may be a lot to ask to have someone show/help
>> >> me,
>> >> >> but
>> >> >> >> I
>> >> >> >> >> really would appreciate it.
>>
>> >> >> >> > This example is for google, of course using pygoogle is easier
>> in
>> >> >> this
>> >> >> >> > case,
>> >> >> >> > but this is a valid example for the general case :
>>
>> >> >> >> >>>>[207]: import urllib, urllib2
>>
>> >> >> >> > You need to trick the server with an imaginary User-Agent.
>>
>> >> >> >> >>>>[208]: def google_search(terms) :
>> >> >> >> >     return
>> >> >> >> urllib2.urlopen(urllib2.Request("http://www.google.com/search?"
>> >> >> >> > +
>> >> >> >> > urllib.urlencode({'hl':'fr', 'q':terms}),
>> >> >> >> >                                          
>> >> >>  headers={'User-Agent':'MyNav
>> >> >> >> > 1.0
>> >> >> >> > (compatible; MSIE 6.0; Linux'})
>> >> >> >> >                           ).read()
>> >> >> >> >    .....:
>>
>> >> >> >> >>>>[212]: res = google_search("python & co")
>>
>> >> >> >> > Now you got the whole html response, you'll have to parse it
>> to
>> >> >> recover
>> >> >> >> > datas,
>> >> >> >> > a quick & dirty try on google response page :
>>
>> >> >> >> >>>>[213]: import re
>>
>> >> >> >> >>>>[214]: [ re.sub('<.+?>', '', e) for e in re.findall('<h2
>> >> >> >> class=r>.*?</h2>',
>> >> >> >> > res) ]
>> >> >> >> > ...[229]:
>> >> >> >> > ['Python Gallery',
>> >> >> >> >  'Coffret Monty Python And Co 3 DVD : La Premi\xe8re folie des
>> >> Monty
>> >> >> >> ...',
>> >> >> >> >  'Re: os x, panther, python &amp; co: msg#00041',
>> >> >> >> >  'Re: os x, panther, python &amp; co: msg#00040',
>> >> >> >> >  'Cardiff Web Site Design, Professional web site design
>> services
>> >> >> ...',
>> >> >> >> >  'Python Properties',
>> >> >> >> >  'Frees &lt; Programs &lt; Python &lt; Bin-Co',
>> >> >> >> >  'Torb: an interface between Tcl and CORBA',
>> >> >> >> >  'Royal Python Morphs',
>> >> >> >> >  'Python &amp; Co']
>>
>> >> >> >> > --
>> >> >> >> > _____________
>>
>> >> >> >> > Maric Michaud
>> >> >> >> > --
>> >> >> >> >http://mail.python.org/mailman/listinfo/python-list
>>
>> >> >> >> --
>> >> >> >> View this message in
>>
>> >> context:http://www.nabble.com/using-urllib2-tp18150669p18160312.html
>> >> >> >> Sent from the Python - python-list mailing list archive at
>> >> Nabble.com.
>>
>> >> >> > --
>> >> >> >http://mail.python.org/mailman/listinfo/python-list
>>
>> >> >> --
>> >> >> View this message in
>> >> >>
>> context:http://www.nabble.com/using-urllib2-tp18150669p18165634.html
>> >> >> Sent from the Python - python-list mailing list archive at
>> Nabble.com.
>>
>> >> > --
>> >> >http://mail.python.org/mailman/listinfo/python-list
>>
>> >> --
>> >> View this message in...
>>
>> read more »
> 
> The definitions were embedded in tables with a 'luna-Ent' class.  I
> pulled all of the tables with that class out, and then returned the
> string value of td containing the actual definition. The findAll
> method takes an optional dictionary, thus the {}.
> --
> http://mail.python.org/mailman/listinfo/python-list
> 
> 

-- 
View this message in context: http://www.nabble.com/using-urllib2-tp18150669p18184087.html
Sent from the Python - python-list mailing list archive at Nabble.com.




More information about the Python-list mailing list