using urllib2

Alexnb alexnbryan at gmail.com
Sun Jun 29 14:59:25 EDT 2008


Actually after looking at this, the code is preactically the same, except the
definitions. So what COULD be going wrong here?

Alexnb wrote:
> 
> Okay, so i've hit a new snag and can't seem to figure out what is wrong.
> What is happening is the first 4 definitions of the word "simple" don't
> show up. The html is basicly the same, with the exception of noun turning
> into adj. Ill paste the html of the word cheese, and then the one for
> simple, and the code I am using to do the work. 
> 
> line of html for the 2nd def of cheese:
> 
> <table class="luna-Ent"><tr><td valign="top" class="dn">2.</td><td
> valign="top">a definite mass of this substance, often in the shape of a
> wheel or cylinder. </td></tr></table>
> 
> line of html for the 2nd def of simple:
> 
> <table class="luna-Ent"><tr><td valign="top" class="dn">2.</td><td
> valign="top">not elaborate or artificial; plain: a simple style. 
> </td></tr></table>
> 
> code:
> 
> import urllib
> from BeautifulSoup import BeautifulSoup
> 
> 
> def get_defs(term):
>     soup =
> BeautifulSoup(urllib.urlopen('http://dictionary.reference.com/search?q=%s'
> % term))
> 
>     for tabs in soup.findAll('table', {'class': 'luna-Ent'}):
>         yield tabs.findAll('td')[-1].contents[-1].string
> 
> word = raw_input("What word would you like to define: ")
> 
> mainList = list(get_defs(word))
> 
> n=0 
> q = 1
> 
> for x in mainList:
>     print str(q)+".  "+str(mainList[n])
>     q=q+1
>     n=n+1
> 
> Now, I don't think it is the italics because one of the definitions that
> worked had them in it in the same format. Any Ideas??!
> 
> 
> Jeff McNeil-2 wrote:
>> 
>> On Jun 29, 12:50 pm, Alexnb <alexnbr... at gmail.com> wrote:
>>> No I figured it out. I guess I never knew that you aren't supposed to
>>> split a
>>> url like "http://www.goo\
>>> gle.com" But I did and it gave me all those errors. Anyway, I had a
>>> question. On the original code you had this for loop:
>>>
>>> for tabs in soup.findAll('table', {'class': 'luna-Ent'}):
>>>         yield tabs.findAll('td')[-1].contents[-1].string
>>>
>>> I hate to be a pain, but I was looking at the BeautifulSoup docs, and
>>> found
>>> the findAll thing. But I want to know why you put "for tabs," also why
>>> you
>>> need the "'table', {'class': 'luna-Ent'}):" Like why the curly braces
>>> and
>>> whatnot?
>>>
>>> Jeff McNeil-2 wrote:
>>>
>>> > On Jun 27, 10:26 pm, Alexnb <alexnbr... at gmail.com> wrote:
>>> >> Okay, so I copied your code(and just so you know I am on a mac right
>>> now
>>> >> and
>>> >> i am using pydev in eclipse), and I got these errors, any idea what
>>> is
>>> >> up?
>>>
>>> >> Traceback (most recent call last):
>>> >>   File
>>> >> "/Users/Alex/Documents/workspace/beautifulSoup/src/firstExample.py",
>>> >> line 14, in <module>
>>> >>     print list(get_defs("cheese"))
>>> >>   File
>>> >> "/Users/Alex/Documents/workspace/beautifulSoup/src/firstExample.py",
>>> >> line 9, in get_defs
>>> >>     dictionary.reference.com/search?q=%s' % term))
>>> >>   File
>>> >>
>>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/url
>>> lib.py",
>>> >> line 82, in urlopen
>>> >>     return opener.open(url)
>>> >>   File
>>> >>
>>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/url
>>> lib.py",
>>> >> line 190, in open
>>> >>     return getattr(self, name)(url)
>>> >>   File
>>> >>
>>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/url
>>> lib.py",
>>> >> line 325, in open_http
>>> >>     h.endheaders()
>>> >>   File
>>> >>
>>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/htt
>>> plib.py",
>>> >> line 856, in endheaders
>>> >>     self._send_output()
>>> >>   File
>>> >>
>>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/htt
>>> plib.py",
>>> >> line 728, in _send_output
>>> >>     self.send(msg)
>>> >>   File
>>> >>
>>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/htt
>>> plib.py",
>>> >> line 695, in send
>>> >>     self.connect()
>>> >>   File
>>> >>
>>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/htt
>>> plib.py",
>>> >> line 663, in connect
>>> >>     socket.SOCK_STREAM):
>>> >> IOError: [Errno socket error] (8, 'nodename nor servname provided, or
>>> not
>>> >> known')
>>>
>>> >> Sorry if it is hard to read.
>>>
>>> >> Jeff McNeil-2 wrote:
>>>
>>> >> > Well, what about pulling that data out using Beautiful soup? If you
>>> >> > know the table name and whatnot, try something like this:
>>>
>>> >> > #!/usr/bin/python
>>>
>>> >> > import urllib
>>> >> > from BeautifulSoup import BeautifulSoup
>>>
>>> >> > def get_defs(term):
>>> >> >     soup = BeautifulSoup(urllib.urlopen('http://
>>> >> > dictionary.reference.com/search?q=%s' % term))
>>>
>>> >> >     for tabs in soup.findAll('table', {'class': 'luna-Ent'}):
>>> >> >         yield tabs.findAll('td')[-1].contents[-1].string
>>>
>>> >> > print list(get_defs("frog"))
>>>
>>> >> > jeff at martian:~$ python test.py
>>> >> > [u'any tailless, stout-bodied amphibian of the order Anura,
>>> including
>>> >> > the smooth, moist-skinned frog species that live in a damp or
>>> >> > semiaquatic habitat and the warty, drier-skinned toad species that
>>> are
>>> >> > mostly terrestrial as adults. ', u' ', u' ', u'a French person or a
>>> >> > person of French descent. ', u'a small holder made of heavy
>>> material,
>>> >> > placed in a bowl or vase to hold flower stems in position. ', u'a
>>> >> > recessed panel on one of the larger faces of a brick or the like.
>>> ',
>>> >> > u' ', u'to hunt and catch frogs. ', u'French or Frenchlike. ', u'an
>>> >> > ornamental fastening for the front of a coat, consisting of a
>>> button
>>> >> > and a loop through which it passes. ', u'a sheath suspended from a
>>> >> > belt and supporting a scabbard. ', u'a device at the intersection
>>> of
>>> >> > two tracks to permit the wheels and flanges on one track to cross
>>> or
>>> >> > branch from the other. ', u'a triangular mass of elastic, horny
>>> >> > substance in the middle of the sole of the foot of a horse or
>>> related
>>> >> > animal. ']
>>>
>>> >> > HTH,
>>>
>>> >> > Jeff
>>>
>>> >> > On Jun 27, 7:28 pm, Alexnb <alexnbr... at gmail.com> wrote:
>>> >> >> I have read that multiple times. It is hard to understand but it
>>> did
>>> >> help
>>> >> >> a
>>> >> >> little. But I found a bit of a work-around for now which is not
>>> what I
>>> >> >> ultimately want. However, even when I can get to the page I want
>>> lets
>>> >> >> say,
>>> >> >> "Http://dictionary.reference.com/browse/cheese", I look on
>>> firebug,
>>> >> and
>>> >> >> extension and see the definition in javascript,
>>>
>>> >> >> <table class="luna-Ent">
>>> >> >> <tbody>
>>> >> >> <tr>
>>> >> >> <td class="dn" valign="top">1.</td>
>>> >> >> <td valign="top">the curd of milk separated from the whey and
>>> prepared
>>> >> in
>>> >> >> many ways as a food. </td>
>>>
>>> >> >> Jeff McNeil-2 wrote:
>>>
>>> >> >> > the problem being that if I use code like this to get the html
>>> of
>>> >> that
>>>
>>> >> >> > page in python:
>>>
>>> >> >> > response = urllib2.urlopen("the webiste....")
>>> >> >> > html = response.read()
>>> >> >> > print html
>>>
>>> >> >> > then, I get a bunch of stuff, but it doesn't show me the code
>>> with
>>> >> the
>>> >> >> > table that the definition is in. So I am asking how do I access
>>> this
>>> >> >> > javascript. Also, if someone could point me to a better
>>> reference
>>> >> than
>>> >> >> the
>>> >> >> > last one, because that really doesn't tell me much, whether it
>>> be a
>>> >> >> book
>>> >> >> > or anything.
>>>
>>> >> >> > I stumbled across this a while back:
>>> >> >> >http://www.voidspace.org.uk/python/articles/urllib2.shtml.
>>> >> >> > It covers quite a bit. The urllib2 module is pretty
>>> straightforward
>>> >> >> > once you've used it a few times.  Some of the class naming and
>>> >> whatnot
>>> >> >> > takes a bit of getting used to (I found that to be the most
>>> >> confusing
>>> >> >> > bit).
>>>
>>> >> >> > On Jun 27, 1:41 pm, Alexnb <alexnbr... at gmail.com> wrote:
>>> >> >> >> Okay, I tried to follow that, and it is kinda hard. But since
>>> you
>>> >> >> >> obviously
>>> >> >> >> know what you are doing, where did you learn this? Or where can
>>> I
>>> >> >> learn
>>> >> >> >> this?
>>>
>>> >> >> >> Maric Michaud wrote:
>>>
>>> >> >> >> > Le Friday 27 June 2008 10:43:06 Alexnb, vous avez écrit :
>>> >> >> >> >> I have never used the urllib or the urllib2. I really have
>>> >> looked
>>> >> >> >> online
>>> >> >> >> >> for help on this issue, and mailing lists, but I can't
>>> figure
>>> >> out
>>> >> >> my
>>> >> >> >> >> problem because people haven't been helping me, which is why
>>> I
>>> >> am
>>> >> >> >> here!
>>> >> >> >> >> :].
>>> >> >> >> >> Okay, so basically I want to be able to submit a word to
>>> >> >> >> dictionary.com
>>> >> >> >> >> and
>>> >> >> >> >> then get the definitions. However, to start off learning
>>> >> urllib2, I
>>> >> >> >> just
>>> >> >> >> >> want to do a simple google search. Before you get mad, what
>>> I
>>> >> have
>>> >> >> >> found
>>> >> >> >> >> on
>>> >> >> >> >> urllib2 hasn't helped me. Anyway, How would you go about
>>> doing
>>> >> >> this.
>>> >> >> >> No,
>>> >> >> >> >> I
>>> >> >> >> >> did not post the html, but I mean if you want, right click
>>> on
>>> >> your
>>> >> >> >> >> browser
>>> >> >> >> >> and hit view source of the google homepage. Basically what I
>>> >> want
>>> >> >> to
>>> >> >> >> know
>>> >> >> >> >> is how to submit the values(the search term) and then search
>>> for
>>> >> >> that
>>> >> >> >> >> value. Heres what I know:
>>>
>>> >> >> >> >> import urllib2
>>> >> >> >> >> response = urllib2.urlopen("http://www.google.com/")
>>> >> >> >> >> html = response.read()
>>> >> >> >> >> print html
>>>
>>> >> >> >> >> Now I know that all this does is print the source, but thats
>>> >> about
>>> >> >> all
>>> >> >> >> I
>>> >> >> >> >> know. I know it may be a lot to ask to have someone
>>> show/help
>>> >> me,
>>> >> >> but
>>> >> >> >> I
>>> >> >> >> >> really would appreciate it.
>>>
>>> >> >> >> > This example is for google, of course using pygoogle is
>>> easier in
>>> >> >> this
>>> >> >> >> > case,
>>> >> >> >> > but this is a valid example for the general case :
>>>
>>> >> >> >> >>>>[207]: import urllib, urllib2
>>>
>>> >> >> >> > You need to trick the server with an imaginary User-Agent.
>>>
>>> >> >> >> >>>>[208]: def google_search(terms) :
>>> >> >> >> >     return
>>> >> >> >> urllib2.urlopen(urllib2.Request("http://www.google.com/search?"
>>> >> >> >> > +
>>> >> >> >> > urllib.urlencode({'hl':'fr', 'q':terms}),
>>> >> >> >> >                                          
>>> >> >>  headers={'User-Agent':'MyNav
>>> >> >> >> > 1.0
>>> >> >> >> > (compatible; MSIE 6.0; Linux'})
>>> >> >> >> >                           ).read()
>>> >> >> >> >    .....:
>>>
>>> >> >> >> >>>>[212]: res = google_search("python & co")
>>>
>>> >> >> >> > Now you got the whole html response, you'll have to parse it
>>> to
>>> >> >> recover
>>> >> >> >> > datas,
>>> >> >> >> > a quick & dirty try on google response page :
>>>
>>> >> >> >> >>>>[213]: import re
>>>
>>> >> >> >> >>>>[214]: [ re.sub('<.+?>', '', e) for e in re.findall('<h2
>>> >> >> >> class=r>.*?</h2>',
>>> >> >> >> > res) ]
>>> >> >> >> > ...[229]:
>>> >> >> >> > ['Python Gallery',
>>> >> >> >> >  'Coffret Monty Python And Co 3 DVD : La Premi\xe8re folie
>>> des
>>> >> Monty
>>> >> >> >> ...',
>>> >> >> >> >  'Re: os x, panther, python & co: msg#00041',
>>> >> >> >> >  'Re: os x, panther, python & co: msg#00040',
>>> >> >> >> >  'Cardiff Web Site Design, Professional web site design
>>> services
>>> >> >> ...',
>>> >> >> >> >  'Python Properties',
>>> >> >> >> >  'Frees < Programs < Python < Bin-Co',
>>> >> >> >> >  'Torb: an interface between Tcl and CORBA',
>>> >> >> >> >  'Royal Python Morphs',
>>> >> >> >> >  'Python & Co']
>>>
>>> >> >> >> > --
>>> >> >> >> > _____________
>>>
>>> >> >> >> > Maric Michaud
>>> >> >> >> > --
>>> >> >> >> >http://mail.python.org/mailman/listinfo/python-list
>>>
>>> >> >> >> --
>>> >> >> >> View this message in
>>>
>>> >> context:http://www.nabble.com/using-urllib2-tp18150669p18160312.html
>>> >> >> >> Sent from the Python - python-list mailing list archive at
>>> >> Nabble.com.
>>>
>>> >> >> > --
>>> >> >> >http://mail.python.org/mailman/listinfo/python-list
>>>
>>> >> >> --
>>> >> >> View this message in
>>> >> >>
>>> context:http://www.nabble.com/using-urllib2-tp18150669p18165634.html
>>> >> >> Sent from the Python - python-list mailing list archive at
>>> Nabble.com.
>>>
>>> >> > --
>>> >> >http://mail.python.org/mailman/listinfo/python-list
>>>
>>> >> --
>>> >> View this message in...
>>>
>>> read more »
>> 
>> The definitions were embedded in tables with a 'luna-Ent' class.  I
>> pulled all of the tables with that class out, and then returned the
>> string value of td containing the actual definition. The findAll
>> method takes an optional dictionary, thus the {}.
>> --
>> http://mail.python.org/mailman/listinfo/python-list
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/using-urllib2-tp18150669p18184170.html
Sent from the Python - python-list mailing list archive at Nabble.com.




More information about the Python-list mailing list